Skip to main content

Full text of "The Art of Computer Programming. Volume 3. Sorting and Searching. Second Edition"

See other formats


the classic work 

NEWLY UPDATED AND REVISED 


The Art of 
Computer 
Programming 

volume 3 

Sorting and Searching 
Second Edition 


DONALD E. KNUTH 


Volume 1/ Fundamental Algorithms 
Third Edition (0-201-89683-4) 

This first volume begins with basic programming 
concepts and techniques, then focuses on 
information structures — the representation 
of information inside a computer, the structural 
relationships between data elements and how 
to deal with them efficiently. Elementary 
applications are given to simulation, numerical 
methods, symbolic computing, software and 
system design. 


Volume 2/ Seminumerical Algorithms 
Third Edition (0-201-89684-2) 

The second volume offers a complete 
introduction to the field of seminumerical 
algorithms, with separate chapters on random 
numbers and arithmetic. The book summarizes 
the major paradigms and basic theory of such 
algorithms, thereby providing a comprehensive 
interface between computer programming 
and numerical analysis. 


Volume 3/ Sorting and Searching 
Second Edition (0-201-89685-0) 

The third volume comprises the most 
comprehensive survey of classical computer 
techniques for sorting and searching. It extends 
the treatment of data structures in Volume I 
to consider both large and small databases and 
internal and external memories. 


Volume 4A/ Combinatorial Algorithms, 

Part 1 (0-201-03804-8) 

This volume introduces techniques that allow 
computers to deal efficiently with gigantic 
problems. Its coverage begins with Boolean 
functions and bitwise tricks and techniques, 
then treats in depth the generation of all 
tuples and permutations, all combinations 
and partitions, and all trees. 














jui m 




THE ART OF 

COMPUTER PROGRAMMING 


SECOND EDITION 


DONALD E. KNUTH Stanford University 


ADDISON-WESLEY 


Volume 3 / Sorting and Searching 


THE ART OF 

COMPUTER PROGRAMMING 

SECOND EDITION 


Upper Saddle River, NJ • Boston • Indianapolis • San Francisco 
New York • Toronto ■ Montreal • London • Munich • Paris • Madrid 
Capetown • Sydney • Tokyo • Singapore • Mexico City 


T^X is a trademark of the American Mathematical Society 
METfl FONT is a trademark of Addison- Wesley 

The author and publisher have taken care in the preparation of this book, but make no 
expressed or implied warranty of any kind and assume no responsibility for errors or 
omissions. No liability is assumed for incidental or consequential damages in connection 
with or arising out of the use of the information or programs contained herein. 

The publisher offers excellent discounts on this book when ordered in quantity for bulk 
purposes or special sales, which may include electronic versions and/or custom covers 
and content particular to your business, training goals, marketing focus, and branding 
interests. For more information, please contact: 

U.S. Corporate and Government Sales (800) 382-3419 
corpsalesOpearsontechgroup . com 

For sales outside the U.S., please contact: 

International Sales international0pearsoned.com 

Visit us on the Web: informit.com/aw 


Library of Congress Cataloging-in-Publication Data 
Knuth, Donald Ervin, 1938- 

The art of computer programming / Donald Ervin Knuth. 
xiv,782 p. 24 cm. 

Includes bibliographical references and index. 

Contents: v. 1. Fundamental algorithms. — v. 2. Seminumerical 
algorithms. — v. 3. Sorting and searching. — v. 4a. Combinatorial 
algorithms , part 1 . 

Contents: v. 3. Sorting and searching . — 2nd ed. 

ISBN 978-0-201-89683-1 (v. 1, 3rd ed.) 

ISBN 978-0-201-89684-8 (v. 2, 3rd ed.) 

ISBN 978-0-201-89685-5 (v. 3, 2nd ed.) 

ISBN 978-0-201-03804-0 (v. 4a) 

1. Electronic digital computers — Programming. 2. Computer 
algorithms . I . Title . 

QA76.6.K64 1997 

005.1 — DC21 QT_ 014* 


Internet page http: //www-cs-f acuity . Stanford. edu/*knuth/taocp.html contains 
current information about this book and related books. 

Copyright © 1998 by Addison -Wesley 

All rights reserved. Printed in the United States of America. This publication is 
protected by copyright, and permission must be obtained from the publisher prior to 
an y prohibited reproduction, storage in a retrieval system, or transmission in any form 
or by any means, electronic, mechanical, photocopying, recording, or likewise. For 
information regarding permissions, write to: 

Pearson Education, Inc. 

Rights and Contracts Department 
501 Boylston Street, Suite 900 
Boston, MA 02116 Fax: (617) 671-3447 
ISBN-13 978-0-201-89685-5 
ISBN-10 0-201-89685-0 

Text printed in the United States at Courier Westford in Westford, Massachusetts. 
Twenty-eighth printing, March 2011 


PREFACE 


Cookery is become an art, 
a noble science; 
cooks are gentlemen. 
— TITUS LIVIUS, Ab Urbe Condita XXXIX. vi 
(Robert Burton, Anatomy of Melancholy 1.2. 2. 2) 


This BOOK forms a natural sequel to the material on information structures in 
Chapter 2 of Volume 1, because it adds the concept of linearly ordered data to 
the other basic structural ideas. 

The title Sorting and Searching” may sound as if this book is only for those 
systems programmers who are concerned with the preparation of general-purpose 
sorting routines or applications to information retrieval. But in fact the area of 
sorting and searching provides an ideal framework for discussing a wide variety 
of important general issues: 

• How are good algorithms discovered? 

• How can given algorithms and programs be improved? 

• How can the efficiency of algorithms be analyzed mathematically? 

• How can a person choose rationally between different algorithms for the 
same task? 

• In what senses can algorithms be proved “best possible”? 

• How does the theory of computing interact with practical considerations? 

• How can external memories like tapes, drums, or disks be used efficiently 
with large databases? 

Indeed, I believe that virtually every important aspect of programming arises 
somewhere in the context of sorting or searching! 

This volume comprises Chapters 5 and 6 of the complete series. Chapter 5 
is concerned with sorting into order; this is a large subject that has been divided 
chiefly into two parts, internal sorting and external sorting. There also are 
supplementary sections, which develop auxiliary theories about permutations 
(Section 5.1) and about optimum techniques for sorting (Section 5.3). Chapter 6 
deals with the problem of searching for specified items in tables or files; this is 
subdivided into methods that search sequentially, or by comparison of keys, or 
by digital properties, or by hashing, and then the more difficult problem of 
secondary key retrieval is considered. There is a surprising amount of interplay 


VI 


PREFACE 


between both chapters, with strong analogies tying the topics together. Two 
important varieties of information structures are also discussed, in addition to 
those considered in Chapter 2, namely priority queues (Section 5.2.3) and linear 
lists represented as balanced trees (Section 6.2.3). 

Like Volumes 1 and 2, this book includes a lot of material that does not 
appear in other publications. Many people have kindly written to me about 
their ideas, or spoken to me about them, and I hope that I have not distorted 
the material too badly when I have presented it in my own words. 

I have not had time to search the patent literature systematically; indeed, 
I decry the current tendency to seek patents on algorithms (see Section 5.4.5). 
If somebody sends me a copy of a relevant patent not presently cited in this 
book, I will dutifully refer to it in future editions. However, I want to encourage 
people to continue the centuries-old mathematical tradition of putting newly 
discovered algorithms into the public domain. There are better ways to earn a 
living than to prevent other people from making use of one’s contributions to 
computer science. 

Before I retired from teaching, I used this book as a text for a student’s 
second course in data structures, at the junior-to-graduate level, omitting most 
of the mathematical material. I also used the mathematical portions of this book 
as the basis for graduate-level courses in the analysis of algorithms, emphasizing 
especially Sections 5.1, 5.2.2, 6.3, and 6.4. A graduate-level course on concrete 
computational complexity could also be based on Sections 5.3, and 5.4.4, together 
with Sections 4.3.3, 4.6.3, and 4.6.4 of Volume 2. 

For the most part this book is self-contained, except for occasional discus- 
sions relating to the MIX computer explained in Volume 1. Appendix B contains a 
summary of the mathematical notations used, some of which are a little different 
from those found in traditional mathematics books. 


Preface to the Second Edition 

This new edition matches the third editions of Volumes 1 and 2, in which I have 
been able to celebrate the completion of T^X and METFIFONT by applying those 
systems to the publications they were designed for. 

The conversion to electronic format has given me the opportunity to go 
over every word of the text and every punctuation mark. I’ve tried to retain 
the youthful exuberance of my original sentences while perhaps adding some 
more mature judgment. Dozens of new exercises have been added; dozens of 
old exercises have been given new and improved answers. Changes appear 
everywhere, but most significantly in Sections 5.1.4 (about permutations and 
tableaux), 5.3 (about optimum sorting), 5.4.9 (about disk sorting), 6.2.2 (about 
entropy), 6.4 (about universal hashing), and 6.5 (about multidimensional trees 
and tries). 


PREFACE 


vii 

/^\ The Art of Computer Programming is, however, still a work in progress. 
JL Research on sorting and searching continues to grow at a phenomenal rate. 
Therefore some parts of this book are headed by an “under construction” icon, 
to apologize for the fact that the material is not up-to-date. For example, if I 
were teaching an undergraduate class on data structures today, I would surely 
discuss randomized structures such as treaps at some length; but at present, I 
am only able to cite the principal papers on the subject, and to announce plans 
for a future Section 6.2.5 (see page 478). My files are bursting with important 
material that I plan to include in the final, glorious, third edition of Volume 3, 
perhaps 17 years from now. But I must finish Volumes 4 and 5 first, and I do 
not want to delay their publication any more than absolutely necessary. 

I am enormously grateful to the many hundreds of people who have helped 
me to gather and refine this material during the past 35 years. Most of the 
hard work of preparing the new edition was accomplished by Phyllis Winkler 
(who put the text of the first edition into form), by Silvio Levy (who 
edited it extensively and helped to prepare several dozen illustrations), and by 
Jeffrey Oldham (who converted more than 250 of the original illustrations to 
METflPOST format). The production staff at Addison-Wesley has also been 
extremely helpful, as usual. 

I have corrected every error that alert readers detected in the first edition — 
as well as some mistakes that, alas, nobody noticed — and I have tried to avoid 
introducing new errors in the new material. However, I suppose some defects still 
remain, and I want to fix them as soon as possible. Therefore I will cheerfully 
award $2.56 to the first finder of each technical, typographical, or historical error. 
The webpage cited on page iv contains a current listing of all corrections that 
have been reported to me. 

Stanford, California D. E. K. 

February 1998 


There are certain common Privileges of a Writer, 
the Benefit whereof, I hope, there will be no Reason to doubt; 
Particularly, that where I am not understood, it shall be concluded, 
that something very useful and profound is coucht underneath. 

— JONATHAN SWIFT, Tale of a Tub, Preface (1704) 



NOTES ON THE EXERCISES 


The EXERCISES in this set of books have been designed for self-study as well 
as for classroom study. It is difficult, if not impossible, for anyone to learn a 
subject purely by reading about it, without applying the information to specific 
problems and thereby being encouraged to think about what has been read. 
Furthermore, we all learn best the things that we have discovered for ourselves. 
Therefore the exercises form a major part of this work; a definite attempt has 
been made to keep them as informative as possible and to select problems that 
are enjoyable as well as instructive. 

In many books, easy exercises are found mixed randomly among extremely 
difficult ones. A motley mixture is, however, often unfortunate because readers 
like to know in advance how long a problem ought to take — otherwise they 
may just skip over all the problems. A classic example of such a situation is 
the book Dynamic Programming by Richard Bellman; this is an important, 
pioneering work in which a group of problems is collected together at the end 
of some chapters under the heading “Exercises and Research Problems,” with 
extremely trivial questions appearing in the midst of deep, unsolved problems. 
It is rumored that someone once asked Dr. Bellman how to tell the exercises 
apart from the research problems, and he replied, “If you can solve it, it is an 
exercise; otherwise it’s a research problem.” 

Good arguments can be made for including both research problems and 
very easy exercises in a book of this kind; therefore, to save the reader from 
the possible dilemma of determining which are which, rating numbers have been 
provided to indicate the level of difficulty. These numbers have the following 
general significance: 

Rating Interpretation 

00 An extremely easy exercise that can be answered immediately if the 
material of the text has been understood; such an exercise can almost 
always be worked “in your head.” 

10 A simple problem that makes you think over the material just read, but 
is by no means difficult. You should be able to do this in one minute at 
most; pencil and paper may be useful in obtaining the solution. 

20 An average problem that tests basic understanding of the text mate- 
rial, but you may need about fifteen or twenty minutes to answer it 
completely. 


IX 


X 


NOTES ON THE EXERCISES 


30 A problem of moderate difficulty and/or complexity; this one may 
involve more than two hours’ work to solve satisfactorily, or even more 
if the TV is on. 

40 Quite a difficult or lengthy problem that would be suitable for a term 
project in classroom situations. A student should be able to solve the 
problem in a reasonable amount of time, but the solution is not trivial. 

50 A research problem that has not yet been solved satisfactorily, as far 
as the author knew at the time of writing, although many people have 
tried. If you have found an answer to such a problem, you ought to 
write it up for publication; furthermore, the author of this book would 
appreciate hearing about the solution as soon as possible (provided that 
it is correct). 

By interpolation in this “logarithmic” scale, the significance of other rating 
numbers becomes clear. For example, a rating of 1 7 would indicate an exercise 
that is a bit simpler than average. Problems with a rating of 50 that are 
subsequently solved by some reader may appear with a 45 rating in later editions 
of the book, and in the errata posted on the Internet (see page iv). 

The remainder of the rating number divided by 5 indicates the amount of 
detailed work required. Thus, an exercise rated 2\ may take longer to solve than 
an exercise that is rated 25, but the latter will require more creativity. 

The author has tried earnestly to assign accurate rating numbers, but it is 
difficult for the person who makes up a problem to know just how formidable it 
will be for someone else to find a solution; and everyone has more aptitude for 
certain types of problems than for others. It is hoped that the rating numbers 
represent a good guess at the level of difficulty, but they should be taken as 
general guidelines, not as absolute indicators. 

This book has been written for readers with varying degrees of mathematical 
training and sophistication; as a result, some of the exercises are intended only for 
the use of more mathematically inclined readers. The rating is preceded by an M 
if the exercise involves mathematical concepts or motivation to a greater extent 
than necessary for someone who is primarily interested only in programming 
the algorithms themselves. An exercise is marked with the letters “HM” if its 
solution necessarily involves a knowledge of calculus or other higher mathematics 
not developed in this book. An U HM" designation does not necessarily imply 
difficulty. 

Some exercises are preceded by an arrowhead, this designates prob- 
lems that are especially instructive and especially recommended. Of course, no 
reader/student is expected to work all of the exercises, so those that seem to 
be the most valuable have been singled out. (This distinction is not meant to 
detract from the other exercises!) Each reader should at least make an attempt 
to solve all of the problems whose rating is 10 or less; and the arrows may help 
to indicate which of the problems with a higher rating should be given priority. 

Solutions to most of the exercises appear in the answer section. Please use 
them wisely; do not turn to the answer until you have made a genuine effort to 


NOTES ON THE EXERCISES 


xi 


solve the problem by yourself, or unless you absolutely do not have time to work 
this particular problem. After getting your own solution or giving the problem a 
decent try, you may find the answer instructive and helpful. The solution given 
will often be quite short, and it will sketch the details under the assumption 
that you have earnestly tried to solve it by your own means first. Sometimes the 
solution gives less information than was asked; often it gives more. It is quite 
possible that you may have a better answer than the one published here, or you 
may have found an error in the published solution; in such a case, the author 
will be pleased to know the details. Later printings of this book will give the 
improved solutions together with the solver’s name where appropriate. 

When working an exercise you may generally use the answers to previous 
exercises, unless specifically forbidden from doing so. The rating numbers have 
been assigned with this in mind; thus it is possible for exercise n + 1 to have a 
lower rating than exercise n, even though it includes the result of exercise n as 
a special case. 


Summary of codes: 

00 

Immediate 



10 

Simple (one minute) 



20 

Medium (quarter hour) 

► 

Recommended 

30 

Moderately hard 

M 

Mathematically oriented 

40 

Term project 

HM 

Requiring “higher math” 

50 

Research problem 


EXERCISES 

► 1. [00] What does the rating “ M20 ” mean? 

2. [10] Of what value can the exercises in a textbook be to the reader? 

3. [HM45] Prove that when n is an integer, n > 2, the equation x n + y n = z n has 
no solution in positive integers x,y,z. 


Two hours' daily exercise . . . will be enough 
to keep a hack fit for his work. 
— M. H. MAHON, The Handy Horse Book (1865) 



CONTENTS 


Chapter 5 — Sorting 2 

*5.1. Combinatorial Properties of Permutations 11 

*5.1.1. Inversions 

*5.1.2. Permutations of a Multiset 22 

*5.1.3. Runs gg 

*5.1.4. Tableaux and Involutions 47 

5.2. Internal sorting 72 

5.2.1. Sorting by Insertion gQ 

5.2.2. Sorting by Exchanging 1 05 

5.2.3. Sorting by Selection 13g 

5.2.4. Sorting by Merging 1 58 

5.2.5. Sorting by Distribution Igg 

5.3. Optimum Sorting 18 g 

5.3.1. Minimum-Comparison Sorting Igg 

*5.3.2. Minimum-Comparison Merging ig7 

*5.3.3. Minimum-Comparison Selection 207 

*5.3.4. Networks for Sorting 219 

5.4. External Sorting 24g 

5.4.1. Multiway Merging and Replacement Selection 252 

*5.4.2. The Polyphase Merge 267 

*5.4.3. The Cascade Merge 288 

*5.4.4. Reading Tape Backwards 299 

*5.4.5. The Oscillating Sort 3H 

*5.4.6. Practical Considerations for Tape Merging 317 

*5.4.7. External Radix Sorting 343 

*5.4.8. Two-Tape Sorting 34g 

*5.4.9. Disks and Drums 3gg 

5.5. Summary, History, and Bibliography 3gg 

Chapter 6 — Searching 39 2 

6.1. Sequential Searching ggg 

6.2. Searching by Comparison of Keys 4gg 

6.2.1. Searching an Ordered Table 4Qg 

6.2.2. Binary Tree Searching 42g 

6.2.3. Balanced Trees 4gg 

6.2.4. Multiway Trees 4gl 

xii 


CONTENTS 


xiii 

6.3. Digital Searching 492 

6.4. Hashing 513 

6.5. Retrieval on Secondary Keys 559 

Answers to Exercises 584 

Appendix A — Tables of Numerical Quantities 748 

1. Fundamental Constants (decimal) 748 

2. Fundamental Constants (octal) 749 

3. Harmonic Numbers, Bernoulli Numbers, Fibonacci Numbers . . . 750 

Appendix B — Index to Notations 752 

Appendix C — Index to Algorithms and Theorems 757 

Index and Glossary 759 



CHAPTER FIVE 


SORTING 


There is nothing more difficult to take in hand, 
more perilous to conduct, or more uncertain in its success, 
than to take the lead in the introduction of 
a new order of things. 

— NICCOLO MACHIAVELLI, The Prince (1513) 

" But you can 't took up all those license 
numbers in time," Drake objected. 

"We don’t have to, Paul. We merely arrange a list 
and look for duplications." 

— PERRY MASON, in The Case of the Angry Mourner (1951) 

"Treesort" Computer — With this new 'computer-approach' 
to nature study you can quickly identify over 260 
different trees of U.S., Alaska, and Canada, 
even palms, desert trees, and other exotics. 

To sort, you simply insert the needle. 

— EDMUND SCIENTIFIC COMPANY, Catalog (1964) 

In THIS CHAPTER we shall study a topic that arises frequently in programming: 
the rearrangement of items into ascending or descending order. Imagine how 
hard it would be to use a dictionary if its words were not alphabetized! We 
will see that, in a similar way, the order in which items are stored in computer 
memory often has a profound influence on the speed and simplicity of algorithms 
that manipulate those items. 

Although dictionaries of the English language define “sorting” as the process 
of separating or arranging things according to class or kind, computer program- 
mers traditionally use the word in the much more special sense of marshaling 
things into ascending or descending order. The process should perhaps be called 
ordering , not sorting; but anyone who tries to call it “ordering” is soon led 
into confusion because of the many different meanings attached to that word. 
Consider the following sentence, for example: “Since only two of our tape drives 
were in working order, I was ordered to order more tape units in short order, 
in order to order the data several orders of magnitude faster.” Mathematical 
terminology abounds with still more senses of order (the order of a group, the 
order of a permutation, the order of a branch point, relations of order, etc., etc.). 
Thus we find that the word “order” can lead to chaos. 

Some people have suggested that “sequencing” would be the best name for 
the process of sorting into order; but this word often seems to lack the right 


1 





2 SORTING 5 

connotation, especially when equal elements are present, and it occasionally 
conflicts with other terminology. It is quite true that “sorting” is itself an 
overused word (“I was sort of out of sorts after sorting that sort of data”), 
but it has become firmly established in computing parlance. Therefore we shall 
use the word “sorting” chiefly in the strict sense of sorting into order, without 
further apologies. 

Some of the most important applications of sorting are: 

a) Solving the “togetherness” problem , in which all items with the same identi- 
fication are brought together. Suppose that we have 10000 items in arbitrary 
order, many of which have equal values; and suppose that we want to rearrange 
the data so that all items with equal values appear in consecutive positions. This 
is essentially the problem of sorting in the older sense of the word; and it can be 
solved easily by sorting the file in the new sense of the word, so that the values 
are in ascending order, Vi < v 2 < ■ ■ ■ < tqoooo • The efficiency achievable in this 
procedure explains why the original meaning of “sorting” has changed. 

b) Matching items in two or more files. If several files have been sorted into the 
same order, it is possible to find all of the matching entries in one sequential pass 
through them, without backing up. This is the principle that Perry Mason used 
to help solve a murder case (see the quotation at the beginning of this chapter). 
We can usually process a list of information most quickly by traversing it in 
sequence from beginning to end, instead of skipping around at random in the 
list, unless the entire list is small enough to fit in a high-speed random-access 
memory. Sorting makes it possible to use sequential accessing on large files, as 
a feasible substitute for direct addressing. 

c) Searching for information by key values. Sorting is also an aid to searching, 
as we shall see in Chapter 6, hence it helps us make computer output more 
suitable for human consumption. In fact, a listing that has been sorted into 
alphabetic order often looks quite authoritative even when the associated nu- 
merical information has been incorrectly computed. 

Although sorting has traditionally been used mostly for business data pro- 
cessing, it is actually a basic tool that every programmer should keep in mind 
for use in a wide variety of situations. We have discussed its use for simplify- 
ing algebraic formulas, in exercise 2.3.2-17. The exercises below illustrate the 
diversity of typical applications. 

One of the first large-scale software systems to demonstrate the versatility 
of sorting was the LARC Scientific Compiler developed by J. Erdwinn, D. E. 
Ferguson, and their associates at Computer Sciences Corporation in 1960. This 
optimizing compiler for an extended FORTRAN language made heavy use of 
sorting so that the various compilation algorithms were presented with relevant 
parts of the source program in a convenient sequence. The first pass was a 
lexical scan that divided the FORTRAN source code into individual tokens, each 
representing an identifier or a constant or an operator, etc. Each token was 
assigned several sequence numbers; when sorted on the name and an appropriate 
sequence number, all the uses of a given identifier were brought together. The 


5 


SORTING 


3 


“defining entries” by which a user would specify whether an identifier stood for a 
function name, a parameter, or a dimensioned variable were given low sequence 
numbers, so that they would appear first among the tokens having a given 
identifier; this made it easy to check for conflicting usage and to allocate storage 
with respect to EQUIVALENCE declarations. The information thus gathered about 
each identifier was now attached to each token; in this way no “symbol table” 
of identifiers needed to be maintained in the high-speed memory. The updated 
tokens were then sorted on another sequence number, which essentially brought 
the source program back into its original order except that the numbering scheme 
was cleverly designed to put arithmetic expressions into a more convenient 
“Polish prefix” form. Sorting was also used in later phases of compilation, to 
facilitate loop optimization, to merge error messages into the listing, etc. In 
short, the compiler was designed so that virtually all the processing could be 
done sequentially from hies that were stored in an auxiliary drum memory, since 
appropriate sequence numbers were attached to the data in such a way that it 
could be sorted into various convenient arrangements. 

Computer manufacturers of the 1960s estimated that more than 25 percent 
of the running time on their computers was spent on sorting, when all their 
customers were taken into account. In fact, there were many installations in 
which the task of sorting was responsible for more than half of the computing 
time. From these statistics we may conclude that either (i) there are many 
important applications of sorting, or (ii) many people sort when they shouldn’t, 
or (iii) inefficient sorting algorithms have been in common use. The real truth 
probably involves all three of these possibilities, but in any event we can see that 
sorting is worthy of serious study, as a practical matter. 

Even if sorting were almost useless, there would be plenty of rewarding rea- 
sons for studying it anyway! The ingenious algorithms that have been discovered 
show that sorting is an extremely interesting topic to explore in its own right. 
Many fascinating unsolved problems remain in this area, as well as quite a few 
solved ones. 

From a broader perspective we will find also that sorting algorithms make a 
valuable case study of how to attack computer programming problems in general. 
Many important principles of data structure manipulation will be illustrated in 
this chapter. We will be examining the evolution of various sorting techniques 
in an attempt to indicate how the ideas were discovered in the first place. By 
extrapolating this case study we can learn a good deal about strategies that help 
us design good algorithms for other computer problems. 

Sorting techniques also provide excellent illustrations of the general ideas 
involved in the analysis of algorithms — the ideas used to determine performance 
characteristics of algorithms so that an intelligent choice can be made between 
competing methods. Readers who are mathematically inclined will find quite a 
few instructive techniques in this chapter for estimating the speed of computer 
algorithms and for solving complicated recurrence relations. On the other hand, 
the material has been arranged so that readers without a mathematical bent can 
safely skip over these calculations. 


4 


SORTING 


5 


Before going on, we ought to define our problem a little more clearly, and 
introduce some terminology. We are given N items 

Ri,R 2 , . . . , Rn 

to be sorted; we shall call them records , and the entire collection of N records 
will be called a, file. Each record Rj has a key, Kj, which governs the sorting 
process. Additional data, besides the key, is usually also present; this extra 
satellite information” has no effect on sorting except that it must be carried 
along as part of each record. 

An ordering relation “<” is specified on the keys so that the following 
conditions are satisfied for any key values a, b, c: 

1 ) Exactly one of the possibilities a < b, a = b, b < a is true. (This is called 

the law of trichotomy.) 

ii) If a < b and b < c, then a < c. (This is the familiar law of transitivity.) 

Properties (i) and (ii) characterize the mathematical concept of linear ordering, 
also called total ordering. Any relationship “<” satisfying these two properties 
can be sorted by most of the methods to be mentioned in this chapter, although 
some sorting techniques are designed to work only with numerical or alphabetic 
keys that have the usual ordering. 

The goal of sorting is to determine a permutation p(l) p(2) ...p(N) of the 
indices { 1 , 2 , . . . , A} that will put the keys into nondecreasing order: 

Kp(i) < -Kp(2) <••• < K P ( N )• (i) 

The sorting is called stable if we make the further requirement that records with 
equal keys should retain their original relative order. In other words, stable 
sorting has the additional property that 

P( l ) < PU ) whenever K p(l) = K p{]) and * < j. ( 2 ) 

In some cases we will want the records to be physically rearranged in storage 
so that their keys are in order. But in other cases it will be sufficient merely to 
have an auxiliary table that specifies the permutation in some way, so that the 
records can be accessed in order of their keys. 

A few of the sorting methods in this chapter assume the existence of either 
or both of the values “oo” and oo”, which are defined to be greater than or 
less than all keys, respectively: 

-oo < Kj < oo, for 1 < j < N. ( 3 ) 

Such extreme values are occasionally used as artificial keys or as sentinel indica- 
tors. The case of equality is excluded in ( 3 ); if equality can occur, the algorithms 
can be modified so that they will still work, but usually at the expense of some 
elegance and efficiency. 

Sorting can be classified generally into internal sorting, in which the records 
are kept entirely in the computer’s high-speed random-access memory, and ex- 
ternal sorting, when more records are present than can be held comfortably in 


5 


SORTING 


5 


memory at once. Internal sorting allows more flexibility in the structuring and 
accessing of the data, while external sorting shows us how to live with rather 
stringent accessing constraints. 

The time required to sort N records, using a decent general-purpose sorting 
algorithm, is roughly proportional to N log IV; we make about log A?' “passes” 
over the data. This is the minimum possible time, as we shall see in Section 5.3.1, 
if the records are in random order and if sorting is done by pairwise comparisons 
of keys. Thus if we double the number of records, it will take a little more 
than twice as long to sort them, all other things being equal. (Actually, as N 
approaches infinity, a better indication of the time needed to sort is N(\og N) 2 , 
if the keys are distinct, since the size of the keys must grow at least as fast as 
log N; but for practical purposes, N never really approaches infinity.) 

On the other hand, if the keys are known to be randomly distributed with 
respect to some continuous numerical distribution, we will see that sorting can 
be accomplished in O(N) steps on the average. 

EXERCISES — First Set 

1. [M20] Prove, from the laws of trichotomy and transitivity, that the permutation 
p(l)p(2) . . .p(N) is uniquely determined when the sorting is assumed to be stable. 

2. [21 ] Assume that each record Rj in a certain file contains two keys, a “major key” 
Kj and a “minor key” kj, with a linear ordering < defined on each of the sets of keys. 
Then we can define lexicographic order between pairs of keys ( K , k) in the usual way: 

( Ki,ki ) < ( Kj,kj ) if Ki < Kj or if Ki = Kj and ki < kj. 

Alice took this file and sorted it first on the major keys, obtaining n groups of 
records with equal major keys in each group, 

Ap(i) — Ap(q) < - -^p(*i+i) — * — A”p(i 2 ) "^ * * * ^p(i n —i+ 1) — * * * — A^p(i n ), 

where i„ = N. Then she sorted each of the n groups R p (i _ 1 +i), ■ ■ ■ ,R p (i ) on their 
minor keys. 

Bill took the same original file and sorted it first on the minor keys; then he took 
the resulting file, and sorted it on the major keys. 

Chris took the same original file and did a single sorting operation on it, using 
lexicographic order on the major and minor keys (Kj, kj). 

Did everyone obtain the same result? 

3. [ M25 ] Let < be a relation on K\, . . . , Kn that satisfies the law of trichotomy but 
not the transitive law. Prove that even without the transitive law it is possible to sort 
the records in a stable manner, meeting conditions (l) and ( 2 ); in fact, there are at 
least three arrangements that satisfy the conditions! 

► 4. [21] Lexicographers don’t actually use strict lexicographic order in dictionaries, 
because uppercase and lowercase letters must be interfiled. Thus they want an ordering 
such as this: 

a < A < aa < AA < AAA < Aachen < aah < • • • < zzz < ZZZ. 

Explain how to implement dictionary order. 


6 


SORTING 


5 



► 5. [M28] Design a binary code for all nonnegative integers so that if n is encoded as 
the string p(n) we have m < n if and only if p(rn) is lexicographically less than p(n). 
Moreover, p(m) should not be a prefix of p(n) for any m # n. If possible, the length of 
p(n) should be at most lgn + O(loglogn) for all large n. (Such a code is useful if we 
want to sort texts that mix words and numbers, or if we want to map arbitrarily large 
alphabets into binary strings.) 

6. [15] Mr. B. C. Dull (a MIX programmer) wanted to know if the number stored in 
location A is greater than, less than, or equal to the number stored in location B. So 
he wrote ‘ LDA A; SUB B” and tested whether register A was positive, negative, or zero. 
What serious mistake did he make, and what should he have done instead? 

7. [17] Write a MIX subroutine for multiprecision comparison of keys, having the 
following specifications: 

Calling sequence: JMP COMPARE 

Entry conditions: rll = n; CONTENTS (A + k) = a k and CONTENTS (B + k) = b k , for 
1 < A; < n; assume that n > 1. 

Exit conditions: Cl = GREATER, if (a„ , . . . , ai ) > (b n , . . . , &i ) ; 

Cl = EQUAL, if (a„, . . . , ai) = (b n , ...,b 1 )- 
Cl = LESS, if (a„, . . . , a i) < (b n , . . . ,b i); 
rX and rll are possibly affected. 

Here the relation (a„, . . . ,a i) < (b n , . . . ,bi) denotes lexicographic ordering from left to 
right; that is, there is an index j such that a k = b k for n > k > j, but a 3 < b 3 . 

► 8. [30] Locations A and B contain two numbers a and b, respectively. Show that it is 
possible to write a MIX program that computes and stores min(a, b) in location C, without 
using any jump operators. (Caution: Since you will not be able to test whether or not 
arithmetic overflow has occurred, it is wise to guarantee that overflow is impossible 
regardless of the values of a and b.) 

9. [ M27 } After N independent, uniformly distributed random variables between 0 
and 1 have been sorted into nondecreasing order, what is the probability that the rth 
smallest of these numbers is < x? 

EXERCISES — Second Set 

Each of the following exercises states a problem that a computer programmer might 
have had to solve in the old days when computers didn’t have much random-access 
memory. Suggest a “good” way to solve the problem, assuming that only a few thousand 
words of internal memory are available, supplemented by about half a dozen tape units 
(enough tape units for sorting). Algorithms that work well under such limitations also 
prove to be efficient on modern machines. 

10. [15] You are given a tape containing one million words of data. How do you 
determine how many distinct words are present on the tape? 

11. [18] You are the U. S. Internal Revenue Service; you receive millions of “informa- 
tion forms from organizations telling how much income they have paid to people, and 
millions of tax forms from people telling how much income they have been paid. How 
do you catch people who don’t report all of their income? 

12. [M25] ( Transposing a matrix.) You are given a magnetic tape containing one 
million words, representing the elements of a 1000 X 1000 matrix stored in order by rows: 
“i.i “i,2 • • • ai.iooo 02,1 • • ■ < 12,1000 • • • <iiooo,iooo. How do you create a tape in which the 


5 


SORTING 


7 


elements are stored by columns 1 . i u. 2,1 . . . a 1000 , 1 1 .2 . ■ . a 1 000.2 ■ . . uiooo,iooo instead? 
(Try to make less than a dozen passes over the data.) 

13. [M26] How could you “shuffle” a large file of N words into a random rearrange- 
ment? 

14. [20] You are working with two computer systems that have different conventions 
for the “collating sequence” that defines the ordering of alphameric characters. How do 
you make one computer sort alphameric files in the order used by the other computer? 

15. [IS] You are given a list of the names of a fairly large number of people born in 
the U.S.A., together with the name of the state where they were born. How do you 
count the number of people born in each state? (Assume that nobody appears in the 
list more than once.) 

16. [20] In order to make it easier to make changes to large FORTRAN programs, you 
want to design a “cross-reference” routine; such a routine takes FORTRAN programs 
as input and prints them together with an index that shows each use of each identifier 
(that is, each name) in the program. How should such a routine be designed? 

► 17. [33] ( Library card sorting.) Before the days of computerized databases, every 
library maintained a catalog of cards so that users could find the books they wanted. 
But the task of putting catalog cards into an order convenient for human use turned out 
to be quite complicated as library collections grew. The following “alphabetical” listing 
indicates many of the procedures recommended in the American Library Association 
Rules for Filing Catalog Cards (Chicago: 1942): 


Text of card 

R. Accademia nazionale dei Lincei, Rome 
1812; ein historischer Roman. 
Bibliotheque d’histoire revolutionnaire. 
Bibliotheque des curiosites. 

Brown, Mrs. J. Crosby 
Brown, John 

Brown, John, mathematician 
Brown, John, of Boston 
Brown, John, 1715-1766 
BROWN, JOHN, 1715-1766 
Brown, John, d. 1811 
Brown, Dr. John, 1810-1882 
Brown- Williams, Reginald Makepeace 
Brown America. 

Brown & Dallison’s Nevada directory. 
Brownjohn, Alan 

Den’, Vladimir Eduardovich, 1867- 
The den. 

Den lieben langen Tag. 

Dix, Morgan, 1827-1908 
1812 ouverture. 

Le XIXe siecle frangais. 

The 1847 issue of U. S. stamps. 

1812 overture. 

I am a mathematician. 


Remarks 

Ignore foreign royalty (except British) 
Achtzehnhundertzwolf 
Treat apostrophe as space in French 
Ignore accents on letters 
Ignore designation of rank 
Names with dates follow those without 
. . . and the latter are subarranged 
by descriptive words 
Arrange identical names by birthdate 
Works “about” follow works “by” 
Sometimes birthdate must be estimated 
Ignore designation of rank 
Treat hyphen as space 
Book titles follow compound names 
& in English becomes “and” 

Ignore apostrophe in names 

Ignore an initial article 

. . . provided it’s in nominative case 

Names precede words 

Dix-huit cent douze 

Dix-neuvieme 

Eighteen forty-seven 

Eighteen twelve 

(a book by Norbert Wiener) 


8 


SORTING 


5 


Text of card 

IBM journal of research and development. 
ha-I ha-ehad. 

Ia; a love story. 

International Business Machines Corporation 
al-KhuwarizmT, Muhammad ibn Musa, 
fl. 813-846 

Labour. A magazine for all workers. 

Labor research association 
Labour, see Labor 
McCall’s cookbook 
McCarthy, John, 1927- 
Machine-independent computer 
programming. 

MacMahon, Maj. Percy Alexander, 

1854-1929 
Mrs. Dalloway. 

Mistress of mistresses. 

Royal society of London 
St. Petersburger Zeitung. 

Saint-Saens, Camille, 1835-1921 
Ste-Marie, Gaston P 
Seminumerical algorithms. 

Uncle Tom’s cabin. 

U. S. bureau of the census. 

Vandermonde, Alexandre Theophile, 
1735-1796 

Van Valkenburg, Mac Elwyn, 1921- 
Von Neumann, John, 1903-1957 
The whole art of legerdemain. 

Who’s afraid of Virginia Woolf? 

Wijngaarden, Adriaan van, 1916- 


Remarks 

Initials are like one-letter words 
Ignore initial article 
Ignore punctuation in titles 

Ignore initial “al-” in Arabic names 
Respell it “Labor” 

Cross-reference card 
Ignore apostrophe in English 
Me = Mac 

Treat hyphen as space 

Ignore designation of rank 
“Mrs.” = “Mistress” 

Don’t ignore British royalty 
“St.” = “Saint”, even in German 
Treat hyphen as space 
Sainte 

(a book by Donald Ervin Knuth) 

(a book by Harriet Beecher Stowe) 
“U. S.” = “United States” 


Ignore space after prefix in surnames 

Ignore initial article 
Ignore apostrophe in English 
Surname begins with uppercase letter 

exceptions, and there are many other rules 


(Most of these rules are subject to certain 
not illustrated here.) 

If you were given the job of sorting large quantities of catalog cards by computer, 
and eventually maintaining a very large file of such cards, and if you had no chance to 
change these long-standing policies of card filing, how would you arrange the data in 
such a way that the sorting and merging operations are facilitated? 

18. [M25] (E. T. Parker.) Leonhard Euler once conjectured [Nova Acta Acad. Sci. 
Petropolitanae 13 (1795), 45-63, §3; written in 1778] that there are no solutions to the 
equation 

u 6 + v 6 + w 6 + x 6 + y 6 = z 6 

in positive integers u, v, w, x, y, z. At the same time he conjectured that 


x^ + --- + x^ 1 = x^ 

would have no positive integer solutions, for all n > 3, but this more general conjecture 
was disproved by the computer-discovered identity 27 5 + 84 s + 110 s + 133 5 = 144 5 ; 
see L. J. Lander, T. R. Parkin, and J. L. Selfridge, Math. Comp. 21 (1967), 446-459. 


5 


SORTING 


9 


Infinitely many counterexamples when n = 4 were subsequently found by Noam Elkies 
[Math. Comp. 51 (1988), 825-835], Can you think of a way in which sorting would 
help in the search for counterexamples to Euler’s conjecture when n = 6? 

► 19. [24 ] Given a file containing a million or so distinct 30-bit binary words xi , . . . , x N , 
what is a good way to find all complementary pairs {xi,Xj} that are present? (Two 
words are complementary when one has 0 wherever the other has 1, and conversely; 
thus they are complementary if and only if their sum is (11 . . . 1) 2 , when they are 
treated as binary numbers.) 

► 20. [25] Given a file containing 1000 30-bit words x \, . . . , Xiooo, how would you pre- 
pare a list of all pairs ( Xi,Xj ) such that x t = Xj except in at most two bit positions? 

21. [22] How would you go about looking for five-letter anagrams such as CARET, 
CARTE, CATER, CRATE, REACT, RECTA, TRACE; CRUEL, LUCRE, ULCER; DOWRY, ROWDY, WORDY? 
[One might wish to know whether there are any sets of ten or more five-letter English 
anagrams besides the remarkable set 

APERS, ASPER, PARES, PARSE, PEARS, PRASE, PRESA, RAPES, REAPS, SPAER, SPARE, SPEAR, 

to which we might add the French word APRES.] 

22. [M28] Given the specifications of a fairly large number of directed graphs, what 
approach will be useful for grouping the isomorphic ones together? (Directed graphs are 
isomorphic if there is a one-to-one correspondence between their vertices and a one-to- 
one correspondence between their arcs, where the correspondences preserve incidence 
between vertices and arcs.) 

23. [30] In a certain group of 4096 people, everyone has about 100 acquaintances. 
A file has been prepared listing all pairs of people who are acquaintances. (The relation 
is symmetric: If x is acquainted with y, then y is acquainted with x. Therefore the file 
contains roughly 200,000 entries.) How would you design an algorithm to list all the 
fc-person cliques in this group of people, given k? (A clique is an instance of mutual 
acquaintances: Everyone in the clique is acquainted with everyone else.) Assume that 
there are no cliques of size 25, so the total number of cliques cannot be enormous. 

► 24. [30] Three million men with distinct names were laid end-to-end, reaching from 
New York to California. Each participant was given a slip of paper on which he wrote 
down his own name and the name of the person immediately west of him in the line. 
The man at the extreme western end didn’t understand what to do, so he threw his 
paper away; the remaining 2,999,999 slips of paper were put into a huge basket and 
taken to the National Archives in Washington, D.C. Here the contents of the basket 
were shuffled completely and transferred to magnetic tapes. 

At this point an information scientist observed that there was enough information 
on the tapes to reconstruct the list of people in their original order. And a computer 
scientist discovered a way to do the reconstruction with fewer than 1000 passes through 
the data tapes, using only • sequential accessing of tape files and a small amount of 
random-access memory. How was that possible? 

[In other words, given the pairs (xi, Xj+i), for 1 < i < N, in random order, 
where the Xi are distinct, how can the sequence X\X 2 . . . xjv be obtained, restricting 
all operations to serial techniques suitable for use with magnetic tapes? This is the 
problem of sorting into order when there is no easy way to tell which of two given keys 
precedes the other; we have already raised this question as part of exercise 2.2.3-25.] 


10 


SORTING 


5 


25. [M21] ( Discrete logarithms.) You know that p is a (rather large) prime number, 
and that a is a primitive root modulo p. Therefore, for all b in the range 1 < b < p, 
there is a unique n such that a n modp = 6, 1 < n < p. (This n is called the index 
of b modulo p, with respect to a.) Explain how to find n, given b, without needing 
O(n) steps. [Hint: Let m = Wp) and try to solve a mni = ba~ n2 (modulo p) for 
0 < ni ,«2 < m.] 


5.1.1 


INVERSIONS 


11 


*5.1. COMBINATORIAL PROPERTIES OF PERMUTATIONS 

A PERMUTATION of a finite set is an arrangement of its elements into a row. 
Permutations are of special importance in the study of sorting algorithms, since 
they represent the unsorted input data. In order to study the efficiency of 
different sorting methods, we will want to be able to count the number of 
permutations that cause a certain step of a sorting procedure to be executed 
a certain number of times. 

We have, of course, met permutations frequently in previous chapters. For 
example, in Section 1.2.5 we discussed two basic theoretical methods of con- 
structing the n\ permutations of n objects; in Section 1.3.3 we analyzed some 
algorithms dealing with the cycle structure and multiplicative properties of 
permutations; in Section 3.3.2 we studied their “runs up” and “runs down.” 
The purpose of the present section is to study several other properties of per- 
mutations, and to consider the general case where equal elements are allowed to 
appear. In the course of this study we will learn a good deal about combinatorial 
mathematics. 

The properties of permutations are sufficiently pleasing to be interesting in 
their own right, and it is convenient to develop them systematically in one place 
instead of scattering the material throughout this chapter. But readers who 
are not mathematically inclined and readers who are anxious to dive right into 
sorting techniques are advised to go on to Section 5.2 immediately, since the 
present section actually has little direct connection to sorting. 


*5.1.1. Inversions 


Let a 1 a 2 ...a n be a permutation of the set {1,2,..., n). If i < j and a t > aj , 
the pair (a i; ci,) is called an inversion of the permutation; for example, the 
permutation 3 1 4 2 has three inversions: (3, 1), (3, 2), and (4, 2). Each inversion is 
a pair of elements that is out of sort, so the only permutation with no inversions is 
the sorted permutation 1 2 ... n. This connection with sorting is the chief reason 
why we will be so interested in inversions, although we have already used the 
concept to analyze a dynamic storage allocation algorithm (see exercise 2. 2. 2-9). 

The concept of inversions was introduced by G. Cramer in 1750 [Intr. a 
V Analyse des Lignes Courbes Algebriques (Geneva: 1750), 657-659; see Thomas 
Muir, Theory of Determinants 1 (1906), 11-14], in connection with his famous 
rule for solving linear equations. In essence, Cramer defined the determinant of 
an n x n matrix in the following way: 




E ( - 1 ) i nv(. 1 « 2 -«")x lai a : 2 a 2 ...x, 


summed over all permutations cq a 2 ■ . . a n of {1, 2, . . . , n}, where inv(aj a 2 . . . a n ) 
is the number of inversions of the permutation. 

The inversion table b\b 2 . . ,b n of the permutation cq a 2 . . . a n is obtained by 
letting bj be the number of elements to the left of j that are greater than j. 


12 


SORTING 


5.1.1 


In other words, bj is the number of inversions whose second component is j. 
It follows, for example, that the permutation 

5 9 1 8 2 6 4 7 3 (r) 

has the inversion table 

2 3 6 4 0 2 2 1 0, (2) 

since 5 and 9 are to the left of 1; 5, 9, 8 are to the left of 2; etc. This permutation 
has 20 inversions in all. By definition the numbers bj will always satisfy 

0 < bi < n - 1, 0 < 6 2 < n - 2, ..., 0 < 6 n _! <1, b n = 0. (3) 

Perhaps the most important fact about inversions is the simple observation 
that an inversion table uniquely determines the corresponding permutation. We 
can go back from any inversion table b 1 b 2 . ..b n satisfying (3) to the unique 
permutation that produces it, by successively determining the relative placement 
of the elements n,n— 1, . . . , 1 (in this order). For example, we can construct the 
permutation corresponding to (2) as follows: Write down the number 9; then 
place 8 after 9, since bg = 1. Similarly, put 7 after both 8 and 9, since 67 = 2. 
Then 6 must follow two of the numbers already written down, because b e = 2; 
the partial result so far is therefore 


9 8 6 7. 

Continue by placing 5 at the left, since b 5 = 0; put 4 after four of the numbers; 
and put 3 after six numbers (namely at the extreme right), giving 

5 9 8 6 4 7 3. 

The insertion of 2 and 1 in an analogous way yields (1). 

This correspondence is important because we can often translate a problem 
stated in terms of permutations into an equivalent problem stated in terms of 
inversion tables, and the latter problem may be easier to solve. For example, 
consider the simplest question of all: How many permutations of {1, 2, . . . , n} are 
possible? The answer must be the number of possible inversion tables, and they 
are easily enumerated since there are n choices for 61 , independently n - 1 choices 
for 62, . . . , 1 choice for b n , making n(n — 1 ) . . . 1 = n! choices in all. Inversions are 
easy to count, because the b’s are completely independent of each other, while 
the a’s must be mutually distinct. 

In Section 1.2.10 we analyzed the number of local maxima that occur when 
a permutation is read from right to left; in other words, we counted how many 
elements are larger than any of their successors. (The right-to-left maxima in (1), 
for example, are 3, 7, 8, and 9.) This is the number of j such that bj has its 
maximum value, n - j. Since bi will equal n - 1 with probability 1/n, and 
(independently) b 2 will be equal to n - 2 with probability l/(n - 1), etc., it is 
clear by consideration of the inversions that the average number of right-to-left 


5.1.1 


INVERSIONS 


13 



Fig. 1 . The truncated octahedron, which shows the change in inversions when adjacent 
elements of a permutation are interchanged. 


maxima is 


1 1 

* T 

n n — 1 



The corresponding generating function is also easily derived in a similar way. 

If we interchange two adjacent elements of a permutation, it is easy to see 
that the total number of inversions will increase or decrease by unity. Figure 1 
shows the 24 permutations of {1,2, 3,4}, with lines joining permutations that 
differ by an interchange of adjacent elements; following any line downward inverts 
exactly one new pair. Hence the number of inversions of a permutation 7 r is the 
length of a downward path from 1234 to tc in Fig. 1; all such paths must have 
the same length. 

Incidentally, the diagram in Fig. 1 may be viewed as a three-dimensional 
solid, the “truncated octahedron,” which has 8 hexagonal faces and 6 square 
faces. This is one of the classical uniform polyhedra attributed to Archimedes 
(see exercise 10). 

The reader should not confuse inversions of a permutation with the inverse 
of a permutation. Recall that we can write a permutation in two-line form 


f 1 2 3 ••• n \ 

\ a i a 2 a 3 ... a n J ’ ( - 4 ' 

the inverse a[ a'- 2 a' 3 . . . a' n of this permutation is the permutation obtained by 
interchanging the two rows and then sorting the columns into increasing order 


14 


SORTING 


5.1.1 


of the new top row: 

fax a 2 a 3 ... a n '\ _ / 1 2 3 ... 

V 1 2 3 • • • n ) ~ \a'x a' 2 a 3 ... 

For example, the inverse of 591826473 is 35971684 2, 

/591826473\ _ /l23456789\ 
\123456789y “ ^3 5 9 7 1 6 8 4 2^ 

Another way to define the inverse is to say that o' = k if and only if a k = j. 

The inverse of a permutation was first defined by H. A. Rothe [in Samm- 
lung combinatorisch-analytischer Abhandlungen, edited by C. F. Hindenburg, 2 
(Leipzig: 1800), 263-305], who noticed an interesting connection between inverses 
and inversions: The inverse of a permutation has exactly as many inversions as 
the permutation itself. Rothe’s proof of this fact was not the simplest possible 
one, but it is instructive and quite pretty nevertheless. We construct an n x n 
chessboard having a dot in column j of row i whenever a, = j. Then we put 
x s in all squares that have dots lying both below (in the same column) and to 
their right (in the same row). For example, the diagram for 5 9 1 8 2 6 4 7 3 is 


n 

°n 

since 


(5) 


X 

X 

X 

X 

• 





X 

X 

X 

X 


X 

X 

X 

• 

• 










X 

X 

X 


X 

X 

• 



• 










X 

X 


• 






X 

• 








X 




• 





• 








The number of x ’s is the number of inversions, since it is easy to see that bj is the 
number of x’s in column j. Now if we transpose the diagram — interchanging 
rows and columns we get the diagram corresponding to the inverse of the 
original permutation. Hence the number of x ’s (the number of inversions) is 
the same in both cases. Rothe used this fact to prove that the determinant of a 
matrix is unchanged when the matrix is transposed. 

The analysis of several sorting algorithms involves the knowledge of how 
many permutations of n elements have exactly k inversions. Let us denote that 
number by I n (k)'. Table 1 lists the first few values of this function. 

By considering the inversion table bx b 2 . . . b n , it is obvious that /„( 0) = 1, 
/„(!) = n — 1, and there is a symmetry property 


I n({ n 2 )~k)= I n (k). 


(6) 



5.1.1 


INVERSIONS 


15 


Table 1 

PERMUTATIONS WITH k INVERSIONS 


n 

In( 0) 

Ml) 

In{ 2) 

In (3) 

In( 4) 

/»(5) 

In( 6) 

In( 7) 

/n(8) 

In( 9) 

In (10) 

/n(ll) 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3 

1 

2 

2 

] 1 

0 

0 

0 

0 

0 

0 

0 

0 

4 

1 

3 

5 

6 

5 

3 

1 

0 

0 

0 

0 

0 

5 

1 

4 

9 

15 

20 

22 

20 

15 

9 

4 

1 

0 

6 

1 

5 

14 

29 

49 

71 

90 

101 

101 

90 

71 

49 


Furthermore, since each of the b’s can be chosen independently of the others, it 
is not difficult to see that the generating function 

G n (z) = In (0) + I n (X)z + I n (2)z 2 + ■ ■ ■ (7) 

satisfies G n (z) = (1 + z + • • • + z n ~ 1 ) G n _i(z); hence it has the comparatively 
simple form noticed by O. Rodrigues [J. de Math. 4 (1839), 236-240]: 

(1 + 2 + • • • + 2"- 1 ) . . . (1 + 2)(1) = (1 - 2") ... (1 - 2 2 )(1 - Z )/(1 - 2)”. (8) 

From this generating function, we can easily extend Table 1, and we can verify 
that the numbers below the zigzag line in that table satisfy 

I n (k) — I n (k - 1) + J„_i(fc), for k < n. (9) 

(This relation does not hold above the zigzag line.) A more complicated argu- 
ment (see exercise 14) shows that, in fact, we have the formulas 


M2)- (2) !. 



n> 2; 

)• 

n > 3; 

V)- 

n > 4; 

3 ) +1 ’ 

n > 5; 


in general, the formula for I n (k) contains about l.Qy/k terms: 


In(k) = 


f n+k — 2 

V k 


!n+k- 3\ fn+k- 6\ /n+fc— 8\ 

V k - 2 ) + V k - 5 ) + V k-7 J 


+ (_!)!( IN + / n+k-Uj-j - 1 

k Uj J ( k Uj j 


n> k, (10) 


where uj — (3 j 2 — j)/ 2 is a so-called “pentagonal number.” 

If we divide G n (z) by n! we get the generating function g n (z) for the 
probability distribution of the number of inversions in a random permutation 


16 


SORTING 


5.1.1 


of n elements. This is the product 


ffn(z) = hi(z)h 2 (z) . ..h n (z), 


(11) 


where hk(z) = (1 + z + 1 - z k 1 )/k is the generating function for the uniform 

distribution of a random nonnegative integer less than k. It follows that 

mean(g„) = mean(hi) + mean (h 2 ) H 1- mean(/i„) 

1 , | n — 1 _ n(n — 1) 

2 '" + 2 “ 4 

var(g n ) = var(hi) + var(h 2 ) H h var (h n ) 

1 

4 + ’ ' + 12 72 


= 0 


= 0 


+ 


— 1 _ n(2n + 5)(n — 1) 


(12) 


(! 3 ) 


So the average number of inversions is rather large, about |n 2 ; the standard 
deviation is also rather large, about | n 3 / 2 . 

A remarkable discovery about the distribution of inversions was made by 
P. A. MacMahon [Amer. J. Math. 35 (1913), 281-322], Let us define the index 
of the permutation ai a 2 . . . a n as the sum of all subscripts j such that a.j > a 3 +\ , 
1 < j < n. For example, the index of 591826473 is 2 + 4 + 6 + 8 = 20. By 
coincidence the index is the same as the number of inversions in this case. If we 
list the 24 permutations of {1,2, 3, 4}, namely 


Permutation 

Index 

Inversions 

Permutation 

Index 

Inversions 

12 3 4 

0 

0 

3 1 1 2 4 

1 

2 

1 2 4|3 

3 

1 

3|1 4|2 

4 

3 

1 3|2 4 

2 

1 

3|2|1 4 

3 

3 

1 3 4|2 

3 

2 

3 j 2 4|1 

4 

4 

1 4|2 3 

2 

2 

3 4|1 2 

2 

4 

1 4|3|2 

5 

3 

3 4|2|1 

5 

5 

2 1 1 3 4 

1 

1 

4|1 2 3 

1 

3 

2 1 4|3 

4 

2 

4|1 3|2 

4 

4 

2 3 1 1 4 

2 

2 

4|2|1 3 

3 

4 

2 3 4|1 

3 

3 

4|2 3|1 

4 

5 

2 4|1 3 

2 

3 

4|3|1 2 

3 

5 

2 4|3|1 

5 

4 

4|3|2|1 

6 

6 


we see that the number of permutations having a given index, k, is the same as 
the number having k inversions. 

At first this fact might appear to be almost obvious, but further scrutiny 
makes it very mysterious. MacMahon gave an ingenious indirect proof, as follows: 
Let ind(ai a 2 . . . a„) be the index of the permutation a\ a 2 . . . a n , and let 


H n {z) = J 2 zindiaia2 ' an) (14) 

be the corresponding generating function; the sum in (14) is over all permutations 
of {1, 2, . . . , n}. We wish to show that H n (z) — G n (z). For this purpose we will 


5.1.1 


INVERSIONS 


17 


define a one-to-one correspondence between arbitrary n-tuples (qi, q 2 , . . . , q n ) of 
nonnegative integers, on the one hand, and ordered pairs of n-tuples 

(( a l, a 2) • • ■ , a n), (Pl,P2 ,---,Pn)) 

on the other hand, where a\ a 2 . . . a n is a permutation of the indices {1, 2 , . . . , n} 
and Pi > Vi > • • • > Pn > 0. This correspondence will satisfy the condition 

9i + 92 H Vq n - ind(ai a 2 . . ■ a n ) + (pi + p 2 H h p„). (15) 

The generating function z9l+92+ ’ +9n ) summed over all n-tuples of nonnega- 
tive integers (qi,q 2 , ■ ■ ■ , q n ), is Q n (z) = 1/(1 — z) n ; and the generating function 
£ z pi+P 2 +-+p» summ ed over all n-tuples of integers (pi,p 2 , • • • ,Pn) such that 
Pi > P2 > ' ' ' > Pn > 0, is 

P n (z) = 1/(1 - Z)( 1 - Z 2 ) ... (1 - z n ), (16) 

as shown in exercise 15. In view of (15), the one-to-one correspondence we are 
about to establish will prove that Q n (z ) = H n (z)P n (z), that is, 

H n (z) = Q n {z)/P n {z). (17) 

But Q n {z)/P n (z ) is G n (z), by (8). 

The desired correspondence is defined by a simple sorting procedure: Any 
n-tuple (qi, q 2 , . . . ,q n ) can be rearranged into nonincreasing order q ai > q a , 2 > 
^ Qa n in a stable manner, where a,i a 2 . . . a n is a permutation such that q a . = 
q aj+1 implies a,j < a j+ We set (pi,P2, . . . ,p„) = (q ai , q a2 , ■ ■ ■ , q an ) and then, for 
1 <j<n, subtract 1 from each of pi , . . . , p 3 for each j such that a 3 > a j+1 . We 
still have Pi > p 2 > • • • > p„, because pj was strictly greater than p J+ i whenever 
a i > a i+ 1 - The resulting pair ((ai, a 2 , . . . , a„), (pi,p 2 , . . . ,p„)) satisfies (15), 
because the total reduction of the p’s is ind(ai a 2 . . . a„). For example, if n = 9 
and (q!,...,q 9 ) = (3, 1, 4, 1, 5, 9, 2, 6, 5), we find ai...o 9 = 685931724 and 
(pi,...,p 9 ) = (5, 2, 2, 2, 2, 2, 1,1,1). 

Conversely, we can easily go back to (q lt q 2 , . . . , q n ) when 01 a 2 . . . a n and 
(PiiP 2) • • • 1 Pn) are given. (See exercise 17.) So the desired correspondence has 
been established, and MacMahon’s index theorem has been proved. 

D. Foata and M. P. Schiitzenberger discovered a surprising extension of 
MacMahon’s theorem, about 65 years after MacMahon’s original publication: 
The number of permutations of n elements that have k inversions and index l is 
the same as the number that have l inversions and index k. In fact, Foata and 
Schiitzenberger found a simple one-to-one correspondence between permutations 
of the first kind and permutations of the second (see exercise 25). 

EXERCISES 

1. [10] What is the inversion table for the permutation 27184593 6? What per- 
mutation has the inversion table 50121200? 

2. [M20] In the classical problem of Josephus (exercise 1. 3.2—22), n men are initially 
arranged in a circle; the mth man is executed, the circle closes, and every mth man is 
repeatedly eliminated until all are dead. The resulting execution order is a permutation 


18 


SORTING 


5.1.1 


°f {1,2 For example, when n = 8 and m = 4 the order is 54613872 (man 1 
is 5th out, etc.); the inversion table corresponding to this permutation is 36310010 
Give a simple recurrence relation for the elements b 1 b 2 ...b n of the inversion table 
m the general Josephus problem for n men, when every mth man is executed. 

3. [18] If the permutation a 1 a 2 ...a n corresponds to the inversion table bi b 2 . . . b n 
what is the permutation oi a 2 ...o„ that corresponds to the inversion table 


(n - 1 - &i)(r 


2 — bo 


■ (0 — b n ) ? 


► 4. [20] Design an algorithm suitable for computer implementation that constructs 
the permutation a 2 ...a n corresponding to a given inversion table b x b 2 . . . b n satis- 
fying (3)- [Hint: Consider a linked-memory technique.] 

5. [35] The algorithm of exercise 4 requires an execution time roughly proportional 
to n + + • • • + 6 n on typical computers, and this is ©(n 2 ) on the average. Is there an 

algorithm whose worst-case running time is substantially better than order ra 2 ? 

► 6. [26] Design an algorithm that computes the inversion table bib 2 . . . b n correspond- 
ing to a given permutation a,a 2 ...a n of {l,2,...,n}, where the running time is 
essentially proportional to n log n on typical computers. 

7. [20] Several other kinds of inversion tables can be defined, corresponding to a 
given permutation «i a 2 . ,.a n of {1, 2, ... , n}, besides the particular table b 2 ...b n 

ehned in the text; in this exercise we will consider three other types of inversion tables 
that arise in applications. 

Let Cj be the number of inversions whose first component is j, that is, the number 
of elements to the rtght of j that are less than j. [Corresponding to (i) we have the 
table 0 0 014 215 7; clearly 0 < e, < j.] Let Bj = b aj and C, = c 0 . 

Show that 0 < Bj < j and 0 < Cj < n - j, for 1 < j < n; furthermore show 
hat the permutation aia 2 ...a n can be determined uniquely when either cic 2 ...c 
or Bi B 2 . . . B n or C\ C 2 . . . C n is given. 

8. [M2 J t ] Continuing the notation of exercise 7, let a\ a' 2 . . . a' n be the inverse of 
he permutation a x a 2 .. . a n , and let the corresponding inversion tables be b\ b' 2 . . . b' n . 

Ci c 2 • • c„, B 1 B 2 ...B n , and C, C' 2 . . . C' n . Find as many interesting relations as you 
can between the numbers a,-, b h C] , Bj, Cj, a'j, b), c', B), C'j. 

► 9. [MSI] Prove that, in the notation of exercise 7, the permutation a 1 a 2 ...a„ is an 
involution (that is, its own inverse) if and only if bj = Cj for 1 < j < n. 

10. [HM20] Consider Fig. 1 as a polyhedron in three dimensions. What is the diam- 

eter of the truncated octahedron (the distance between vertex 1234 and vertex 4321) 
if all of its edges have unit length? ' 

11. [M25] If 7 t = ai a 2 ... a n is a permutation of {1,2,..., n}, let 

■®'( 7r ) — {( a ii a j) | * < J, > dj} 

be the set of its inversions, and let 


-®( 7r ) {( a ii a j) | i~> j, Hi ~> dj} 

be the non-inversions. 

a) Prove that E(n) and E(n) are transitive. (A set S of ordered pairs is called 
transitive if (o,c) is in S whenever both (a, b) and (b, c) are in S.) 


5.1.1 


INVERSIONS 


19 


b) Conversely, let E be any transitive subset of T = {(a;,?/) | 1 < y < x < n} whose 
complement E = T\E is also transitive. Prove that there exists a permutation n 
such that E(n) = E. 

12. [M28] Continuing the notation of the previous exercise, prove that if 7Ti and 7T2 
are permutations and if E is the smallest transitive set containing E(ni) U E(iV 2 ), then 
E is transitive. [Hence, if we say m is “above” 7 t 2 whenever E( 7Ti) C E( 7 r 2 ), a lattice 
of permutations is defined; there is a unique “lowest” permutation “above” two given 
permutations. Figure 1 is the lattice diagram when n = 4.] 

13. [M23] It is well known that half of the terms in the expansion of a determinant 
have a plus sign, and half have a minus sign. In other words, there are just as many 
permutations with an even number of inversions as with an odd number, when n > 2. 
Show that, in general, the number of permutations having a number of inversions 
congruent to t modulo m is n!/m, regardless of the integer t. whenever n > m. 

14. [M24] (F. Franklin.) A partition of n into k distinct parts is a representation 
n = Pi + P 2 + • ■ ■ + Pk, where pi > p 2 > • • • > Pk > 0. For example, the partitions of 7 
into distinct parts are 7, 6 + 1, 5 + 2, 4 + 3, 4 + 2 + 1. Let f k {n) be the number of 
partitions of n into k distinct parts; prove that Y.k(~ l ) k fk( n ) = 0, unless n has the 
form (3 j 2 ±j)/ 2, for some nonnegative integer j; in the latter case the sum is (-1)+ 
For example, when n = 7 the sum is -1 + 3-1 = 1, and 7 = (3 • 2 2 + 2)/2. [Hint: 
Represent a partition as an array of dots, putting p t dots in the ith row, for 1 < i < k. 
Find the smallest j such that p 3 +i < pj — 1, and encircle the rightmost dots in the first 
j rows. If j < pk, these j dots can usually be removed, tilted 45°, and placed as a new 
(fc+l)st row. On the other hand if j > p k , the fcth row of dots can usually be removed, 
tilted 45 , and placed to the right of the circled dots. (See Fig. 2.) This process pairs 
off partitions having an odd number of rows with partitions having an even number of 
rows, in most cases, so only unpaired partitions must be considered in the sum.] 



Fig. 2. Franklin’s correspondence between partitions with distinct parts. 


Note: As a consequence, we obtain Euler’s formula 

(1 — z )(l — z 2 )(l — z 3 ) . . . = 1 — z — z 2 + z 5 + z 7 — z 12 — 2 15 + • • • 

— oo<j<oo 

The generating function for ordinary partitions (whose parts are not necessarily dis- 
tinct) is ^2,p(n)z n — 1/(1 — z)( 1 — 2 2 )(1 — z 3 ) . . . ; hence we obtain a nonobvious 
recurrence relation for the partition numbers, 

p(n) = p{n - 1) + p(n - 2) - p(n - 5) - p(n - 7) + p(n - 12) + p(n - 15) . 


20 


SORTING 


5.1.1 


15. [MSS] Prove that (16) is the generating function for partitions into at most n 
parts; that is, prove that the coefficient of z m in l/(l-z)(l-z 2 )...(l- z n ) is the 
number of ways to write m = pi + p 2 + • • • + p„ with pi > p 2 > • • • > p„ > 0. 
[Hint: Drawing dots as in exercise 14, show that there is a one-to-one correspondence 
between n-tuples (pi . p 2 ..... p n ) such that Pi > p 2 > ■ ■ • > p n > 0 and sequences 
(Pi, P2, P3, ■ ■ ■) such that n > Pi > P 2 > P3 >■■■ > 0, with the property that 

Pi + P2 H \-Pn = P\ + P2 + Ps-\ . In other words, partitions into at most n parts 

correspond to partitions into parts not exceeding n.] 

16. [M25] (L. Euler.) Prove the following identities by interpreting both sides of the 
equations in terms of partitions: 

TT I = l 

t > L 0 (! - qkz ) (!-*)(! -qz)(l-q 2 z)... 


1 + — - — + u ... 

1 -q (l-q)(l-q 2 ) 


n> 0 / k=l 


n (! + q k z) = (1 + z)(l + qz)( 1 + qz ) . . . 

fc >0 

= 1+ — h ^ + ... 

1-q (l-q)(l-q2) + 




17. [20] In MacMahon’s correspondence defined at the end of this section, what are 
the 24 quadruples (91, 92, q 3 , qt) for which (p\,P2,Pz,Pi) = (0,0, 0,0)? 

18. [M30] (T. Hibbard, CACM 6 (1963), 210.) Let n > 0, and assume that a sequence 
of 2 n n-bit integers Xq, . . . , X2 n ~i has been generated at random, where each bit of 
each number is independently equal to 1 with probability p. Consider the sequence 
Xo ® 0, Xi © 1, . . . , X 2 n-i © (2 n — 1), where © denotes the “exclusive or” operation 

on the binary representations. Thus if p = 0, the sequence is 0, 1, , 2” — 1, and if 

P — 1 if is 2" — 1, . . . , 1, 0; and when p = | , each element of the sequence is a random 
integer between 0 and 2 n — 1. For general p this is a useful way to generate a sequence 
of random integers with a biased number of inversions, although the distribution of 
the elements of the sequence taken as a whole is uniform in the sense that each n-bit 
integer has the same distribution. What is the average number of inversions in such a 
sequence, as a function of the probability p? 

19. [M28] (C. Meyer.) When m is relatively prime to n, we know that the sequence 
(m mod n)(2m mod n) . . . ((n — l)m mod n) is a permutation of (1, 2, . . . , n — 1}. Show 
that the number of inversions of this permutation can be expressed in terms of Dedekind 
sums (see Section 3.3.3). 

20. [M43] The following famous identity due to Jacobi [ Fundaments Nova Theorise 
Functionum Ellipticarum (1829), §64] is the basis of many remarkable relationships 
involving elliptic functions: 


H(1 - «V _1 )(1 - u*-V)(l - u k v k ) 
k>1 

— (1 — u)(l — u)(l — uv)( 1 — u v)(l — uv 2 )(l — u 2 v 2 ) . . . 
= 1 — {u + v) + ( u 3 v + uv 3 ) — (u 6 v 3 + u 3 v 6 ) + • ■ • 

- E (-i)^U^ 1 ). 

— oo<j<+oo 


5 . 1.1 


INVERSIONS 


21 


For example, if we set u — z, v = z 2 , we obtain Euler’s formula of exercise 14. If we 
set z = \/u/v, q = y/uv, we obtain 

n(l-g 2fc " 1 ^)(l-q 2A; - 1 ^ 1 )(l-q 2 ' t )= £ (-1 Tz n q n \ 

k>l — oo<n<oo 

Is there a combinatorial proof of Jacobi’s identity, analogous to Franklin’s proof 
of the special case in exercise 14? (Thus we want to consider “complex partitions” 

m + ni = (pi + q\i) + (p 2 + 92 *) H 1- {pk + qki ) 

where the pj + qji are distinct nonzero complex numbers, pj and Qj being nonnegative 
integers with \pj — qj \ < 1. Jacobi’s identity says that the number of such represen- 
tations with k even is the same as the number with k odd, except when m and n 
are consecutive triangular numbers.) What other remarkable properties do complex 
partitions have? 

► 21. [ M25 ] (G. D. Knott.) Show that the permutation a\...a„ is obtainable with 
a stack, in the sense of exercise 2.2. 1-5 or 2.3. 1-6, if and only if Cj < Cj+i + 1 for 
1 < j < n in the notation of exercise 7. 

22. [ M26 ] Given a permutation ai o 2 . . . a n of {1,2,..., n}, let hj be the number of 
indices i < j such that a t 6 {aj + 1, a.j + 2 , . . . , a J+ i }. (If a ]+ \ < aj , the elements of this 
set “wrap around” from ntol. When j = n we use the set {a„+l, a„+2, . . . , n}.) For 
example, the permutation 591826473 leads to hi ... hg = 0 0 1 2 1 4 2 4 6. 

a) Prove that ai o 2 . . . a n can be reconstructed from the numbers hi h 2 . . . h n . 

b) Prove that hi + h 2 + • • • + h n is the index of oi a 2 . . . a„. 

► 23. [ M27 ] ( Russian roulette.) A group of n condemned men who prefer probability 
theory to number theory might choose to commit suicide by sitting in a circle and 
modifying Josephus’s method (exercise 2) as follows: The first prisoner holds a gun 
and aims it at his head; with probability p he dies and leaves the circle. Then the 
second man takes the gun and proceeds in the same way. Play continues cyclically, 
with constant probability p > 0, until everyone is dead. 

Let a,j = k if man k is the jth to die. Prove that the death order ai o 2 . . .a„ 
occurs with a probability that is a function only of n, p, and the index of the dual 
permutation (n + 1 — a n ) . . . (n + 1 — a 2 ) (n + 1 — ai). What death order is least likely? 

24. [M26] Given integers f(l) t(2) . . . t(n) with t(j) > j, the generalized index of a 
permutation ai a 2 . . . a n is the sum of all subscripts j such that aj > t(aj+ 1 ), plus the 
total number of inversions such that i < j and t(aj) > Oj > aj. Thus when t(j) = j for 
all j, the generalized index is the same as the index; but when t(j) > n for all j it is the 
number of inversions. Prove that the number of permutations whose generalized index 
equals k is the same as the number of permutations having k inversions. [Hint: Show 
that, if we take any permutation ai . . . a n -i of {1, . . . , n — 1} and insert the number n 
in all possible places, we increase the generalized index by the numbers {0, 1, . . . , n — 1} 
in some order.] 

► 25. [M30] (Foata and Schiitzenberger.) If a = ai . . .a n is a permutation, let ind(a) 
be its index, and let inv(a) count its inversions. 

a) Define a one-to-one correspondence that takes each permutation a of {1, . . . ,n} 
to a permutation /(a) that has the following two properties: (i) ind(/(a)) = 
inv(a); (ii) for 1 < j < n, the number j appears to the left of j + 1 in f(a) 
if and only if it appears to the left of j + 1 in a. What permutation does your 


22 


SORTING 


5.1.1 


construction assign to f(a) when a = 19826374 5? For what permutation a is 
f(a) = 198263745? [Hint: If n > 1, write a = xiaix 2 a 2 ...XkOtka n , where 
Xi, . . . , x k are all the elements < a n if a\ < a n , otherwise x\, . . . , Xk are all the 
elements > a n ; the other elements appear in (possibly empty) strings ai, . . . , a fc . 
Compare the number of inversions of h(a) = a x xia 2 X 2 ■ . ■ ctkXk to inv(a); in this 
construction the number a„ does not appear in h(a).] 
b) Use / to define another one-to-one correspondence g having the following two 
properties: (i) ind(g(a)) = inv(a); (ii) inv(g(a)) = ind(a). [Hint: Consider 
inverse permutations.] 

26. [M25] What is the statistical correlation coefficient between the number of inver- 
sions and the index of a random permutation? (See Eq. 3 . 3 . 2 -( 24 ).) 

27. [ M37 ] Prove that, in addition to ( 15 ), there is a simple relationship between 
inv(oi 02 . . . a n ) and the n-tuple ( 91 , 92 , • • • , 9 n)- Use this fact to generalize the deriva- 
tion of ( 17 ), obtaining an algebraic characterization of the bivariate generating function 

H„(w,z) = J2w inV{ai “2 • a n) ;z ind(a 1 a 2 ...a n ) , 

where the sum is over all n! permutations a x a 2 ■ ■ ■ a n - 

► 28. [25] If aia 2 ...a„ is a permutation of {1,2, ...,n}, its total displacement is 
defined to be 1 a J ~ j\- Find upper and lower bounds for total displacement 
in terms of the number of inversions. 

29. [28] If 7 r = a\a 2 ... a„ and n' = a[a 2 . . . a' n are permutations of {1,2,..., n}, 
their product 7 T 7 r' is a' ai a'„ 2 . . . a' an . Let inv( 7 r) denote the number of inversions, as in 
exercise 25. Show that inv( 7 T 7 r') < inv( 7 r) -t-inv(Tr'), and that equality holds if and only 
if 7 T 7 r' is “below” k' in the sense of exercise 12 . 

*5.1.2. Permutations of a Multiset 

So far we have been discussing permutations of a set of elements; this is just a 
special case of the concept of permutations of a multiset. (A multiset is like a set 
except that it can have repetitions of identical elements. Some basic properties 
of multisets have been discussed in exercise 4.6.3-19.) 

For example, consider the multiset 

M = {a, a, a, b , b, c, d, d, d, d}, ( 1 ) 

which contains 3 a’s, 2 b's, 1 c, and 4 d’s. We may also indicate the multiplicities 
of elements in another way, namely 

M = {3 • a, 2 • b, c, 4 • d}. ( 2 ) 

A permutation* of M is an arrangement of its elements into a row; for example, 

cabddabdad. 

From another point of view we would call this a string of letters, containing 3 a’s, 
2 b's, 1 c, and 4 d’s. 

How many permutations of M are possible? If we regarded the elements 
of M as distinct, by subscripting them a x , a 2 , a 3 , b x , b 2 , ci, d x , d 2 , d 3 , d 4 , 


Sometimes called a “permatution. : 


5.1.2 


PERMUTATIONS OF A MULTISET 


23 


we would have 10! = 3,628,800 permutations; but many of those permutations 
would actually be the same when we removed the subscripts. In fact, each 
permutation of M would occur exactly 3! 2! 1! 4! = 288 times, since we can start 
with any permutation of M and put subscripts on the a’s in 3! ways, on the 
fe’s (independently) in 2! ways, on the c in 1 way, and on the d's in 4! ways. 
Therefore the true number of permutations of M is 


10 ! 

3! 2! 1! 4! 


12,600. 


In general, we can see by this same argument that the number of permutations 
of any multiset is the multinomial coefficient 


= ^T7.’ (3 > 

where n x is the number of elements of one kind, n 2 is the number of another 
kind, etc., and n = n i + n 2 + • • • is the total number of elements. 

The number of permutations of a set has been known for more than 1500 
years. The Hebrew Book of Creation (c. A.D. 400), which was the earliest literary 
product of Jewish philosophical mysticism, gives the correct values of the first 
seven factorials, after which it says “Go on and compute what the mouth cannot 
express and the ear cannot hear.” [Sefer Yetzirah, end of Chapter 4. See Solomon 
Gandz, Studies in Hebrew Astronomy and Mathematics (New York: Ktav, 1970), 
494-496; Aryeh Kaplan, Sefer Yetzirah (York Beach, Maine: Samuel Weiser, 
1993).] This is one of the first two known enumerations of permutations in 
history. The other occurs in the Indian classic Anuyogadvarasutra (c. 500), rule 
97, which gives the formula 


6x5x4x3x2xl-2 

for the number of permutations of six elements that are neither in ascending nor 
descending order. [See G. Chakravarti, Bull. Calcutta Math. Soc. 24 (1932), 
79-88. The Anuyogadvarasutra is one of the books in the canon of Jainism, 
a religious sect that flourishes in India.] 

The corresponding formula for permutations of multisets seems to have 
appeared first in the Lilavati of Bhaskara (c. 1150), sections 270-271. Bhaskara 
stated the rule rather tersely, and illustrated it only with two simple examples 
{2, 2, 1, 1} and {4, 8, 5, 5, 5}. Consequently the English translations of his work 
do not all state the rule correctly, although there is little doubt that Bhaskara 
knew what he was talking about. He went on to give the interesting formula 

(4 + 8 + 5 + 5 + 5) x 120 x 11111 
5x6 

for the sum of the 20 numbers 48555 + 45855 + • • • . 

The correct rule for counting permutations when elements are repeated was 
apparently unknown in Europe until Marin Mersenne stated it without proof 
as Proposition 10 in his elaborate treatise on melodic principles [ Harmonie 
Universelle 2, also entitled Traitez de la Voix et des Chants (1636), 129-130]. 



24 


SORTING 


5.1.2 


Mersenne was interested in the number of tunes that could be made from a given 
collection of notes; he observed, for example, that a theme by Boesset, 



can be rearranged in exactly 15!/(4! 3! 3! 2!) = 756,756,000 ways. 

The general rule (3) also appeared in Jean Prestet’s Elemens des Mathema- 
tiques (Paris: 1675), 351-352, one of the very first expositions of combinatorial 
mathematics to be written in the Western world. Prestet stated the rule correctly 
for a general multiset, but illustrated it only in the simple case {a, a, 6, b, c, c}. 
A few years later, John Wallis’s Discourse of Combinations (Oxford: 1685), 
Chapter 2 (published with his Treatise of Algebra ) gave a clearer and somewhat 
more detailed discussion of the rule. 

In 1965, Dominique Foata introduced an ingenious idea called the “inter- 
calation product,” which makes it possible to extend many of the known results 
about ordinary permutations to the general case of multiset permutations. [See 
Publ. Inst. Statistique, Univ. Paris, 14 (1965), 81-241; also Lecture Notes in 
Math. 85 (Springer, 1969).] Assuming that the elements of a multiset have been 
linearly ordered in some way, we may consider a two-line notation such as 

( a a a b b c d d d d\ 

\c a b d d a b d a d J 7 ^) 

where the top line contains the elements of M sorted into nondecreasing order 
and the bottom line is the permutation itself. The intercalation product aj/3 of 
two multiset permutations a and j3 is obtained by (a) expressing a and (I in the 
two-line notation, (b) juxtaposing these two-line representations, and (c) sorting 
the columns into nondecreasing order of the top line. The sorting is supposed 
to be stable, in the sense that left-to-right order of elements in the bottom line 
is preserved when the corresponding top line elements are equal. For example, 
c a d a b j bddad = cabddabdad , since 

fa a b c d\ f a b d d d\ _faaabbcddd d\ 
\cadab) J \bddad)~\cabddabdadJ 

It is easy to see that the intercalation product is associative: 

( a T /?) T 7 = a T (/? T 7); 
it also satisfies two cancellation laws: 

TTja = TTj/3 implies a = /?, 
a j 7r = fi x 7r implies a = /?. 

There is an identity element, 

aje = eja — a, 


( 6 ) 

(7) 

(8) 


5.1.2 


PERMUTATIONS OF A MULTISET 


25 


where e is the null permutation, the “arrangement” of the empty set. Although 
the commutative law is not valid in general (see exercise 2), we do have 

aj/3 = /3ja if a and (i have no letters in common. (9) 

In an analogous fashion we can extend the concept of cycles in permutations 
to cases where elements are repeated; we let 


(xi x 2 ... x n ) (10) 

stand for the permutation obtained in two-line form by sorting the columns of 

fx\ x 2 ... x n \ 

\ x 2 x 3 ... X\ ) 

by their top elements in a stable manner. For example, we have 

d b d d a c a a b d\ faaabbcdddd ' 


(11) 


( dbddacaabd) = 


bddacaabdd 


cabddabdad 


so the permutation (4) is actually a cycle. We might render this cycle in words 
by saying something like “d goes to b goes to d goes to d goes . . . goes to d 
goes back.” Note that these general cycles do not share all of the properties of 
ordinary cycles; (aq x 2 ... x n ) is not always the same as (x 2 . . ,x n x\). 

We observed in Section 1.3.3 that every permutation of a set has a unique 
representation (up to order) as a product of disjoint cycles, where the “product” 
of permutations is defined by a law of composition. It is easy to see that 
the product of disjoint cycles is exactly the same as their intercalation ; this 
suggests that we might be able to generalize the previous results, obtaining a 
unique representation (in some sense) for any permutation of a multiset, as the 
intercalation of cycles. In fact there are at least two natural ways to do this, 
each of which has important applications. 

Equation (5) shows one way to factor cabddabdad as the intercala- 
tion of shorter permutations; let us consider the general problem of finding all 
factorizations ir = a j (3 of a given permutation 7r. It will be helpful to consider 
a particular permutation, such as 

( a a b b b b b c c c d d d d d\ , . 

dbcbcacdaddbbbd)' ^ 12 ' 

as we investigate the factorization problem. 

If we can write this permutation it in the form a j /3, where a contains the 
letter a at least once, then the leftmost a in the top line of the two-line notation 
for a must appear over the letter d, so a must also contain at least one occurrence 
of the letter d. If we now look at the leftmost d in the top line of a, we see in 
the same way that it must appear over the letter d, so a must contain at least 
two d’ s. Looking at the second d, we see that a also contains at least one b. We 
have deduced the partial result 


( a b d d 

d d b 


(13) 


26 


SORTING 


5.1.2 


on the sole assumption that a is a left factor of 7 r containing the letter a. 
Proceeding in the same manner, we find that the b in the top line of (13) must 
appear over the letter c, etc. Eventually this process will reach the letter a again, 
and we can identify this a with the first a if we choose to do so. The argument 
we have just made essentially proves that any left factor a of (12) that contains 
the letter a has the form (d d b c d b b c a) j a', for some permutation a'. (It 
is convenient to write the a last in the cycle, instead of first; this arrangement 
is permissible since there is only one a.) Similarly, if we had assumed that a 
contains the letter b, we would have deduced that a = (c d d b)ja" for some a". 

In general, this argument shows that, if we have any factorization aj/3 = n, 
where a contains a given letter y, exactly one cycle of the form 

(xi ... x n y), n > 0, xi, . . . , x n ^ y, (14) 

is a left factor of a. This cycle is easily determined when 7r and y are given; it is 
the shortest left factor of 7r that contains the letter y. One of the consequences 
of this observation is the following theorem: 

Theorem A. Let the elements of the multiset M be linearly ordered by the 
relation < . Every permutation n of M has a unique representation as the 
intercalation 

TT = (xil...X lni yi)j(x 2 l...X2n 2 y2)T--r(x t l...X tnt y t ), t > 0, (15) 

where the following two conditions are satisfied: 

yi < y 2 < ■ ■ ■ < y t and y { < Xij for 1 < j < rq, 1 < i < t. (16) 

(In other words, the last element in each cycle is smaller than every other element, 
and the sequence of last elements is in nondecreasing order.) 

Proof. If 7T = e, we obtain such a factorization by letting t = 0. Otherwise 
we let yi be the smallest element permuted; and we determine (i n . . . x lni yi), 
the shortest left factor of ix containing 2/1 , as in the example above. Now 7r = 

fan • • ■ x ir»i 2/i) TP for some permutation p: by induction on the length, we can 
write 

P = (^21 • • • X 2 n 2 2/2) T • • • T (x t i . . . x tnt 2 /(), t > 1, 

where (16) is satisfied. This proves the existence of such a factorization. 

Conversely, to prove that the representation (15) satisfying (16) is unique, 
clearly t = 0 if and only if 7T is the null permutation e. When t > 0, (16) 
implies that 2/1 is the smallest element permuted, and that (x n . . . x lni y x ) is 
the shortest left factor containing 2/1 • Therefore (x u ... x lni y x ) is uniquely 
determined, by the cancellation law (7) and induction, the representation is 
unique. | 

For example, the canonical” factorization of (12), satisfying the given con- 
ditions, is 

(d d b c d b b c a)j(b a) T (c d b)j(d), (i 7 ) 


if a < b < c < d. 


5.1.2 


PERMUTATIONS OF A MULTISET 


27 


It is important to note that we can actually drop the parentheses and the 
t’s in this representation, without ambiguity! Each cycle ends just after the first 
appearance of the smallest remaining element. So this construction associates 
the permutation 

n' = ddbcdbbcabacdbd 
with the original permutation 

Tr = dbcbcacdaddbbbd. 

Whenever the two-line representation of 7r had a column of the form v x , where 
x < y, the associated permutation 7 r' has a corresponding pair of adjacent 
elements . . . y x . . . . Thus our example permutation 7r has three columns of the 
form f , and 7r' has three occurrences of the pair d b. In general this construction 
establishes the following remarkable theorem: 

Theorem B. Let M be a multiset. There is a one-to-one correspondence 
between the permutations of M such that, if n corresponds to 7r', the following 
conditions hold: 

a) The leftmost element of 7r' equals the leftmost element of 7r. 

b) For all pairs of permuted elements (x, y) with x < y, the number of occur- 

rences of the column % in the two-line notation of tt is equal to the number of 
times x is immediately preceded by y in 7 r'. | 

When M is a set, this is essentially the same as the “unusual correspondence” 
we discussed near the end of Section 1.3.3, with unimportant changes. The more 
general result in Theorem B is quite useful for enumerating special kinds of 
permutations, since we can often solve a problem based on a two-line constraint 
more easily than the equivalent problem based on an adjacent-pair constraint. 

P. A. MacMahon considered problems of this type in his extraordinary 
book Combinatory Analysis 1 (Cambridge Univ. Press, 1915), 168-186. He 
gave a constructive proof of Theorem B in the special case that M contains 
only two different kinds of elements, say a and b ; his construction for this 
case is essentially the same as that given here, although he expressed it quite 
differently. For the case of three different elements a, b, c, MacMahon gave 
a complicated nonconstructive proof of Theorem B; the general case was first 
proved constructively by Foata [Comptes Rendus Acad. Sci. 258 (Paris, 1964), 
1672-1675]. 

As a nontrivial example of Theorem B, let us find the number of strings of 
letters a, b, c containing exactly 

A occurrences of the letter a; 

B occurrences of the letter b; 

C occurrences of the letter c; 
k occurrences of the adjacent pair of letters ca; 
l occurrences of the adjacent pair of letters cb ; 
m occurrences of the adjacent pair of letters ba. (18) 


28 


SORTING 


5.1.2 


The theorem tells us that this is the same as the number of two-line arrays of 
the form 

A B r. 


a ... a b 


A—k—m a’s m a’s 


B—l 6’s 


b c 


The a’s can be placed in the second line in 


A 

A-k- 


then the 6’s can be placed in the remaining positions ir 


B + k\ / C-k 
B — l ) { l 


The positions that are still vacant must be filled by c’s; hence the desired number 


A — k — m ) \ m 


B + k\ / C-k 
B — l ) [ l 


Let us return to the question of finding all factorizations of a given per- 
mutation. Is there such a thing as a “prime” permutation, one that has no 
intercalation factors except itself and e? The discussion preceding Theorem A 
leads us quickly to conclude that a permutation is prime if and only if it is a 
cycle with no repeated elements. For if it is such a cycle, our argument proves 
that there are no left factors except e and the cycle itself. And if a permutation 
contains a repeated element y, it has a nontrivial cyclic left factor in which y 
appears only once. 

A nonprime permutation can be factored into smaller and smaller pieces 
until it has been expressed as a product of primes. Furthermore we can show 
that the factorization is unique, if we neglect the order of factors that commute: 

Theorem C. Every permutation of a multiset can be written as a product 

T<72 t • " T<r t , t > 0 , ( 21 ) 

where each (jj is a cycle having no repeated elements. This representation is 
unique, in the sense that any two such representations of the same permuta- 
tion may be transformed into each other by successively interchanging pairs of 
adjacent disjoint cycles. 


5.1.2 


PERMUTATIONS OF A MULTISET 


29 


The term “disjoint cycles” means cycles having no elements in common. As 
an example of this theorem, we can verify that the permutation 

{ a a b b c c d\ 

\b a a c d b c J 

has exactly five factorizations into primes, namely 

(a b ) T (a) t (c d) T ( b c) = (a b) T (c d) j (a) j ( b c ) 

= (ab) T (c d) r (6 c) T (a) 

= (cd) T (a 6) t (6 c) T (a) 

= M)t (a 6) T (a) t (b c). (22) 

Proof. We must show that the stated uniqueness property holds. By induction 
on the length of the permutation, it suffices to prove that if p and a are unequal 
cycles having no repeated elements, and if 

pja = aj/3, 

then p and cr are disjoint, and 

a = <7 T 0, (3 = pj9, 

for some permutation 6. 

If y is any element of the cycle p, then any left factor of a j (3 containing the 
element y must have p as a left factor. So if p and <7 have an element in common, 
cr is a multiple of p\ hence a = p (since they are primes), contradicting our as- 
sumption. Therefore the cycle containing y, having no elements in common with 
cr, must be a left factor of ft. The proof is completed by using the cancellation 
law (7). | 

As an example of Theorem C, let us consider permutations of the multiset 
M = {A ■ a, B ■ b, C ■ c} consisting of A a’s, B b' s, and C c’s. Let N(A, B , C, m) 
be the number of permutations of M whose two-line representation contains no 
columns of the forms “ , c c , and exactly m columns of the form % . It follows 
that there are exactly A — m columns of the form “, B — m of the form l , 
C - B + m of the form c a , C — A + m of the form b c , and A + B — C - m of the 
form b . Hence 

J v(w.»)-(^)( c _f + m )( B ® m ). (=3) 

Theorem C tells us that we can count these permutations in another way: 
Since columns of the form “ , 5 , c c are excluded, the only possible prime factors 
of the permutation are 

(a b), (a c ), (b c), ( a b c), (a c b). (24) 

Each pair of these cycles has at least one letter in common, so the factorization 
into primes is completely unique. If the cycle (a b c) occurs k times in the 
factorization, our previous assumptions imply that (a b) occurs m — k times, 



30 


SORTING 


5.1.2 


C b c) occurs C - A + m - k times, (a c) occurs C — B + m - k times, and 
(a c b ) occurs A + B — C — 2m + k times. Hence N(A, B , C, m) is the number 
of permutations of these cycles (a multinomial coefficient), summed over k: 

N(A,B,C, m) 

_ ■> (C + m — k)\ 

fe ( m-k)\(C-A + m-k)\(C-B + m-k)\k\(A + B-C~2m + k)\ 


?(!)(£)( 


A — m 

C — B + m — k 


) rr 


(25) 


Comparing this with (23), we find that the following identity must be valid: 

?C) (cAll. k ) ( C+ T k ) - ( c- b a+ J (b- J- (-) 


This turns out to be the identity we met in exercise 1.2.6-31, namely 
+ / N + R - S\ / f? + j \ _ \ / 5 \ 

y' 3 A IV-j Am + _ \m) \n)’ 


(27) 


with M = A + B-C-m, N = C-B+m, R = B,S = C, and j = C-B+m-k. 

Similarly we can count the number of permutations of {A- a, B b, C c, D d) 
such that the number of columns of various types is specified as follows: 

Column a a b b c c d d 

type: d b a c b da 0 

Frequency: r A-r q B-q B-A + r D-r A-q D-A + q 

(Here A + C = B + D.) The possible cycles occurring in a prime factorization 
of such permutations are then 

Cycle: (a b) ( b c ) (c d) (da) (abed) (d c b a) 

Frequency: A-r-s B-q-s D-r-s A-q-s s q-A + r + s ' 

for some s (see exercise 12). In this case the cycles (a b) and (c d) commute with 
each other, and so do (b c) and (d a), so we must count the number of distinct 
prime factorizations. It turns out (see exercise 10) that there is always a unique 
factorization such that no (c d) is immediately followed by (a b), and no (d a) is 
immediately followed by (b c). Hence by the result of exercise 13, we have 

E Z-Bw A — q — s \ / B + D — r — s — t\ 

\ t J \A — r — s — t) V B-q-s J 

D\ 

(D — r — s)! (A-g-s)! s! (g-A + r + s)! 


5.1.2 


PERMUTATIONS OF A MULTISET 


31 


Taking out the factor from both sides and simplifying the factorials slightly 
leaves us with the complicated-looking five-parameter identity 
f B\ f A — r — t\ f B + D — r — s — t\ / D — A + q\ f A — q \ 

~\t/\ s /\ D + q — r — t ) \ D — r — s ) \r + t — q) 

-( A XT/)0 <*» 

The sum on s can be performed using ( 27 ), and the resulting sum on t is easily 
evaluated; so, after all this work, we were not fortunate enough to discover any 
identities that we didn’t already know how to derive. But at least we have 
learned how to count certain kinds of permutations, in two different ways, and 
these counting techniques are good training for the problems that lie ahead. 

EXERCISES 

1. [M05] True or false: Let Mi and M 2 be multisets. If a is a permutation of Mi 
and /3 is a permutation of M 2 , then a j /3 is a permutation of Mi U M 2 . 

2. [10] The intercalation of c a d a b and b d d a d is computed in ( 5 ); find the 

intercalation b d d a d j c a d a b that is obtained when the factors are interchanged. 

3. [MIS] Is the converse of ( 9 ) valid? In other words, if a and B commute under 
intercalation, must they have no letters in common? 

4. [Mil] The canonical factorization of ( 12 ), in the sense of Theorem A, is given 
in ( 17 ) when a < b < c < d. Find the corresponding canonical factorization when 
d < c < b < a. 

5. [M23] Condition (b) of Theorem B requires x < y, what would happen if we 
weakened the relation to x < y? 

6 . [Ml 5] How many strings are there that contain exactly m o’ s, n b’s, and no other 
letters, with exactly k of the a’s preceded immediately by a 6 ? 

7. [M21] How many strings on the letters a, b, c satisfying conditions ( 18 ) begin 
with the letter a? with the letter 6 ? with c? 

► 8 . [20] Find all factorizations of ( 12 ) into two factors aj/3. 

9. [S3] Write computer programs that perform the factorizations of a given multiset 
permutation into the forms mentioned in Theorems A and C. 

► 10. [M30] True or false: Although the factorization into primes isn’t quite unique, 
according to Theorem C, we can ensure uniqueness in the following way: “There is a 
linear ordering A of the set of primes such that every permutation of a multiset has a 
unique factorization <7 it<7 2 t • ■ • T&n into primes subject to the condition that a t ■< cr i+1 
whenever < 7 , commutes with cq+i , for 1 < i < n.” 

► 11. [M26] Let cti , ( 72 , . .. ,<r t be cycles without repeated elements. Define a partial or- 
dering -< on the t objects {xi ,. . ■ ,x t } by saying that Xi -< Xj if i < j and a, has at least 
one letter in common with crj . Prove the following connection between Theorem C and 
the notion of “topological sorting” (Section 2.2.3): The number of distinct prime factor- 
izations of CT 1 JO 2 T • --[crt is the number of ways to sort the given partial ordering topo- 
logically. (For example, corresponding to ( 22 ) we find that there are five ways to sort the 
ordering x\ A X 2 , X 3 -< X 4 , xi -< X 4 topologically.) Conversely, given any partial order- 
ing on t elements, there is a set of cycles {<7i, 02 , ■ ■ ■ , <7t} that defines it in the stated way. 


32 


SORTING 


5.1.2 


12 . [ M16 ] Show that (29) is a consequence of the assumptions of (28). 

13 . [M21 ] Prove that the number of permutations of the multiset 

{A- a, B b, C c, D ■ d, E ■ e, F • /} 
containing no occurrences of the adjacent pairs of letters ca and db is 

Y' { D \ ( A + B + E + F \ ( A + B + C+E + F~t\ (C + D + E + F\ 
t \A — t) \ t A B A C,D,E,F )■ 

14 . [M30] One way to define the inverse n~ of a general permutation 7r, suggested by 
other definitions in this section, is to interchange the lines of the two-line representation 
of 7r and then to do a stable sort of the columns in order to bring the top row into 
nondecreasing order. For example, if a < b < c < d, this definition implies that the 
inverse of cabddabdad is acdadabbdd. 

Explore properties of this inversion operation; for example, does it have any simple 
relation with intercalation products? Can we count the number of permutations such 
that 7 r = 7r~ ? 

► 15 . [M25] Prove that the permutation ai ...a n of the multiset 

{nr • 2Jl, Tl2 X 2 , • . . , Tim * Em } , 

where X\ < *2 < • • • < x m and ni + n 2 + • • • + n m = n, is a cycle if and only if the 
directed graph with vertices {*i,*2,...,®m} and arcs from Xj to a ni+ ... +nj contains 
precisely one oriented cycle. In the latter case, the number of ways to represent the 
permutation in cycle form is the length of the oriented cycle. For example, the directed 
graph corresponding to 


faaabbcccdd 
\d cbacaabdc 



and the two ways to represent the permutation as a cycle ar e(baddcacabc) and 
(c a d d c a c b a b). 

16 . [M55] We found the generating function for inversions of permutations in the 
previous section, Eq. 5.1.1-(8), in the special case that a set was being permuted. 
Show that, in general, if a multiset is permuted, the generating function for inversions 
of {ni • ®i, r*2 • *a» . . ■ } is the “2-multinomial coefficient” 


( " 
\ni,n 2 ,. 



n\ z 

ni'.z n 2 \z-.-' 


771 

where m\ z = (1 + 2 -I b z k ~ 1 ). 

k = 1 


[Compare with (3) and with the definition of 2-nomial coefficients in Eq. 1.2.6-(4o).] 

17 . [M24] Find the average and standard deviation of the number of inversions in 
a random permutation of a given multiset, using the generating function found in 
exercise 16. 


18. [M30] (P. A. MacMahon.) The index of a permutation <21 a 2 . . . a n was defined 
in the previous section; and we proved that the number of permutations of a given 
set that have a given index k is the same as the number of permutations that have k 
inversions. Does the same result hold for permutations of a given multiset? 


5.1.2 


PERMUTATIONS OF A MULTISET 


33 


19 . [ HMZ8 ] Define the Mobius function p(n) of a permutation n to be 0 if 7r contains 
repeated elements, otherwise (-l) fc if n is the product of k primes. (Compare with the 
definition of the ordinary Mobius function, exercise 4.5.2-10.) 

a) Prove that if 7r / e, we have 

= 0 , 

summed over all permutations A that are left factors of n (namely all A such that 
7r = A j p for some p ). 

b) Given that x\ < X 2 < ■ • • < x m and 7r = xy Xy . . . Xj„, where 1 < ik < m for 
1 < k < n, prove that 


p(n) = (~l) n e(iii 2 . . .i n ), where e(ii i 2 . . . i n ) = sign (i k - ij). 

l<j<fe<n 

► 20 . [HM33] (D. Foata.) Let ( ay ) be any matrix of real numbers. In the notation of 
exercise 19(b), define v(n) — a< ui . . . a , n]n , where the two-line notation for 7r is 



This function is useful in the computation of generating functions for permutations of 
a multiset, because v(ir), summed over all permutations 7r of the multiset 

{ni • x \ , . . . , n m ' Xm~\ , 

will be the generating function for the number of permutations satisfying certain 
restrictions. For example, if we take oy = z for i = j, and ay = 1 for i / j, 
then "M is the generating function for the number of “fixed points” (columns in 
which the top and bottom entries are equal). In order to study J2 v(n) for all multisets 
simultaneously, we consider the function 


g= Y1 7rv ( 7r ) 

summed over all n in the set {aq, . . . , x m }* of all permutations of multisets involving 
the elements xi,...,x m , and we look at the coefficient of s" 1 . . . ij," in G. 

In this formula for G we are treating 7r as the product of the x’s. For example, 
when m = 2 we have 

G = l + Xll'(xi)+X2l/(x2) + XlXll'(xiXl)+XlX2l'(xiX2)+X2Xll'(x2Xl)+X2X2l'(X2X2)-\ 

= 1 + Xian -\-X2Cl22 +Xjaji +XlX2ana22 +XlX2a2iai2 + *2 a 22 + • • • • 

Thus the coefficient of x" 1 . . . x^ r ' in G is summed over all permutations n of 

{ni • xi , . . . , n m ■ Xm}. It is not hard to see that this coefficient is also the coefficient of 
x" 1 . . . xj," in the expression 


(anXi T * * ‘ T OlmXm) (a 2lXl T ■ * ■ T 02mXm) * * * (amlXl “h * * * T OjmmXm) * 


The purpose of this exercise is to prove what P. A. MacMahon called a “Master 
Theorem” in his Combinatory Analysis 1 (1915), Section 3, namely the formula 


G = 1/D, 


where 


D = det 


/ 1 — anxi 

I —021X1 


— 012X2 
1 — a22X2 


Ol mXm \ 
m®m 

1 Q'mm%m ' 




— a m 2^2 


34 


SORTING 


5.1.2 


For example, if a l3 = 1 for all i and j, this formula gives 


G=l/(l-(ii+i 2 i h x m )), 


and the coefficient of x" 1 . . . x^™ turns out to be (ni + • • • + n m )!/m! . . . n m !, as it 
should. To prove the Master Theorem, show that 

a) u(tt t p) = v(n) i/(p); 

b) D = Xv 7r/i(7r)^(7r), in the notation of exercise 19, summed over all permutations 
7r in {xi,...,x m }*; 

c) therefore D ■ G = 1. 

21. [M21] Given m, . . . , n m , and d > 0, how many permutations o,\ a 2 . . . a n of the 

multiset {m • 1, . . . , n m • m} satisfy a j+1 > aj - d for 1 < j < n = m 4 h n m ? 

22. [M30] Let P(x" 1 . . . iJJ" ) denote the set of all possible permutations of the multi- 
set {m n m -x m }, and let P^x^x” 1 ...x^ m ) be the subset of P^x" 1 ...x"">) 

in which the first no elements are / xo. 

a) Given a number t with 1 < t < m, find a one-to-one correspondence between 
P(l ni . . . rn"' m ) and the set of all ordered pairs of permutations that belong re- 
spectively to P o (0 fc l ni . . . f"‘) and P o (0 fc (t+l) nt+1 ... m nm ), for some k > 0. [Hint: 
For each jt = ai . . . o n G P(l ni . . . m n ’ rl ), let Z(7r) be the permutation obtained by 

replacing t + 1, . . . , m by 0 and erasing all Os in the last n t+ 1 H f- n rn positions; 

similarly, let r( n) be the permutation obtained by replacing 1, . . . , t by 0 and 
erasing all Os in the first n\ + 1 - n t positions.] 

b) Prove that the number of permutations of P 0 (0 n ° l ni . . . m nm ) whose two-line form 
has pj columns ° and qj columns J 0 is 

]P(xf .. -x^y^ ...ylr-*") 1 [PK 1 . ..x*ry?- n . 

|Pq (0 n ° l n i ... m nrn ) | 

c) Let w i, . . . , w m , zi, • ■ • , z m be complex numbers on the unit circle. Define the 
weight w( 7r) of a permutation n € P(l ni . . . m rlm ) as the product of the weights 
of its columns in two-line form, where the weight of { is Wj /w k if j and k are 
both < t or both > t, otherwise it is Zj/z k . Prove that the sum of w(n) over all 
7T € P(l ni . . . m nm ) is 


E 


kl 2 (n< t - k)l (n> t - A;)! 


nil . . . n„ 




where n< t is m + • • • + n«, n >t is n t+1 -f • • • + n m , and the inner sum is over all 
(pi, • • • ,Pm) such that p< t = p >t = k. 

23. [M23\ A strand of DNA can be thought of as a word on a four-letter alphabet. 
Suppose we copy a strand of DNA and break it completely into one-letter bases, then 
recombine those bases at random. If the resulting strand is placed next to the original, 
prove that the number of places in which they differ is more likely to be even than odd. 
[Hint: Apply the previous exercise.] 

24. [27] Consider any relation R that might hold between two unordered pairs of 
letters; if {w,x}R{y,z} we say {w,x} preserves {y,z}, otherwise {w,x} moves {y,z}. 

The operation of transposing ” * with respect to R replaces f by £ “ or f , 
according as the pair {w,x} preserves or moves the pair {y, z}, assuming that w ^ x 
and y ^ z; if w = x or y = z the transposition always produces * ™ . 


5.1.3 


RUNS 


35 


The operation of sorting a two-line array (*{ ;;; *" ) with respect to R repeatedly 
finds the largest Xj such that x 3 > x J+l and transposes columns j and j + 1, until 
eventually x\ < ■ ■ ■ < x n . (We do not require 2/1 . . . y n to be a permutation of x% . . . x n .) 

a) Given ;;; *"), prove that for every x € {xi, . . . ,x n } there is a unique y 6 

(Vu- • ■ ,Vn } such that sort(*{ ;;; *” ) = sort(**? ;;; *?) for some x' 2 , y' 2 , ■ . . ,x' n , y' n . 

b) Let ;;; ”£) ® (*J ;;; *') denote the result of sorting (“* ;;; ”£ x z \ ;;; *' ) with 

respect to R. For example, if R is always true, ® sorts {uq , . . . , Wk , ii , . . . , xi }, 

but it simply juxtaposes yi . . .yic with Z\ . . . zp if R is always false, ® is the inter- 
calation product j. Generalize Theorem A by proving that every permutation it 
of a multiset M has a unique representation of the form 

7T - (in • • • X ini 2/1 ) ® ((x 2 l . . . X 2 n 2 2/2) ® ( ■ • • ® (*tl ■ ■ • Xtn t Vt) ' ' ' )) 

satisfying (16), if we redefine cycle notation by letting the two-line array (11) 
correspond to the cycle (x 2 ... x n xi) instead of to (aq x 2 ... x n ). For example, 
suppose {w, x}R{y, z} means that w, x, y, and z are distinct; then it turns out 
that the factorization of (12) analogous to (17) is 

( ddbca ) ® (( cbba ) ® (( cdb ) ® ((db) ® (d)))) . 

(The operation ® does not always obey the associative law; parentheses in the 
generalized factorization should be nested from right to left.) 

*5.1.3. Runs 

In Chapter 3 we analyzed the lengths of upward runs in permutations, as a way 
to test the randomness of a sequence. If we place a vertical line at both ends 
of a permutation ai a 2 . ■ . a n and also between cij and a J+1 whenever a,j > a J+i , 
the runs are the segments between pairs of lines. For example, the permutation 

I 3 5 7 | 1 6 8 9 | 4 | 2 | 

has four runs. The theory developed in Section 3.3.2G determines the average 
number of runs of length k in a random permutation of {1,2 ,..., n}, as well as 
the covariance of the numbers of runs of lengths j and k. Runs are important in 
the study of sorting algorithms, because they represent sorted segments of the 
data, so we will now take up the subject of runs once again. 

Let us use the notation 

O M 

to stand for the number of permutations of {1,2, ...,n} that have exactly k 
“descents” a 3 > aj + 1, thus exactly k + 1 ascending runs. These numbers ({“) 
arise in several contexts, and they are usually called Eulerian numbers since 
Euler discussed them in his famous book Institutiones Calculi Differentialis 
(St. Petersburg: 1755), 485-487, after having introduced them several years 
earlier in a technical paper [Comment. Acad. Sci. Imp. Petrop. 8 (1736), 147- 
158, §13]; they should not be confused with the Euler numbers E n discussed in 
exercise 5.1.4-23. The angle brackets in (£) remind us of the “>” sign in the 
definition of a descent. Of course (£) is also the number of permutations that 
have k “ascents” a 3 < a J+1 . 


36 


SORTING 


5.1.3 


We can use any given permutation of {1 , . . . , n - 1} to form n new permuta- 
tions, by inserting the element n in all possible places. If the original permutation 
has k descents, exactly A:+ 1 of these new permutations will have k descents; the 
remaining n — 1 — k will have k + 1, since we increase the number of descents 
unless we place the element n at the end of an existing run. For example, the 
six permutations formed from 312 4 5 are 

631245, 361245, 316245, 

312645, 312465, 312456; 

all but the second and last of these have two descents instead of one. Therefore 
we have the recurrence relation 

( A; ) ~ ^ + k ) + i)> integer n > 0, integer k. (2) 

By convention we set 



saying that the null permutation has no descents. The reader may find it 
interesting to compare (2) with the recurrence relations for Stirling numbers 
in Eqs. 1.2.6-(46). Table 1 lists the Eulerian numbers for small n. 

Several patterns can be observed in Table 1. By definition, we have 



Eq. (6) follows from (5) because of a general rule of symmetry, 

O^L-l-k)' for " al ’ M 

which comes from the fact that each nonnull permutation ai a 2 . . . a n having 
k descents has n — 1 — k ascents. 

Another important property of the Eulerian numbers is the formula 

E /n\ f m + k\ 

\k)\ n )=“ ’ ni °- <*> 

k 

which was discovered by the Chinese mathematician Li Shan-Lan and pub- 
lished in 1867. [See J.-C. Martzloff, A History of Chinese Mathematics (Berlin: 
Springer, 1997), 346-348; special cases for n < 5 had already been known to 
Yoshisuke Matsunaga in Japan, who died in 1744.] Li Shan-Lan’s identity follows 
from the properties of sorting: Consider the m” sequences a x 0,2 . . . a n such that 
1 < di <m. We can sort any such sequence into nondecreasing order in a stable 
manner, obtaining 


^ a i2 — ' ' ‘ ^ a i n 


(9) 


5 . 1.3 


Table 1 

EULERIAN NUMBERS 


RUNS 


37 


n 

(o> 

0 

C> 

(a) 

<:> 

/n\ / n \ / n \ 

\ 5 / \ 6 / \ 7 / 

/ n \ 

\ 8 / 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

2 

1 

1 

0 

0 

0 

0 

0 

0 

0 

3 

1 

4 

1 

0 

0 

0 

0 

0 

0 

4 

1 

11 

11 

1 

0 

0 

0 

0 

0 

5 

1 

26 

66 

26 

1 

0 

0 

0 

0 

6 

1 

57 

302 

302 

57 

1 

0 

0 

0 

7 

1 

120 

1191 

2416 

1191 

120 

1 

0 

0 

8 

1 

247 

4293 

15619 

15619 

4293 

247 

1 

0 

9 

1 

502 

14608 

88234 

156190 

88234 

14608 

502 

1 


where ii i 2 . ■ . i n is a uniquely determined permutation of {1,2,..., n} such that 
a-ij = aq +1 implies ij < i J+1 ; in other words, ij > ij +x implies that aq. < a ij+1 . 
If the permutation *i i 2 ■ • • i n has k runs, we will show that the number of 
corresponding sequences aia 2 ...a n is ( m+ ”“ fc ) . This will prove (8) if we replace 
k by n — k and use (7), because (£) permutations have n — k runs. 

For example, if n = 9 and i x i 2 . . . i n = 35716894 2 , we want to count the 
number of sequences a x a 2 ■ . . a n such that 


1 < 0-3 < as < «7 < ai < a 6 < as < a g < 04 < a 2 < m; (10) 

this is the number of sequences 61 b 2 ... 69 such that 


1 < 61 < b 2 < b 3 < 64 < b 5 < b 6 < br < b s < 69 < m + 5 , 


since we can let 6j = a 3 , b 2 = a 5 + 1, 63 = a 7 + 2, 64 = ai + 2, b 5 — a 6 + 3, 
etc. The number of choices of the b' s is simply the number of ways of choosing 
9 things out of m + 5 , namely ( m ^ 5 ) ; a similar proof works for general n and k, 
and for any permutation i\ i 2 . . . i n with k runs. 

Since both sides of (8) are polynomials in m, we may replace m by any real 
number x, and we obtain an interesting representation of powers in terms of 
consecutive binomial coefficients: 


xwx: 1 ) 


+ ••• + 


l)( I + n ')' "- 1 ' (1,) 


For example, 

‘ 3 =GMTMT)' 

This is the key property of Eulerian numbers that makes them useful in the 
study of discrete mathematics. 

Setting x = 1 in (11) proves again that („" 1 ) = 1 , since the binomial 
coefficients vanish in all but the last term. Setting x — 2 yields 


n 

n — 2 


2 n - n - 1, 


n > 1. 


(12) 



38 


SORTING 


5.1.3 


Setting x — 3, 4, ... shows that relation (n) completely defines the numbers 
(£)? and leads to a formula originally given by Euler: 

(")=(* + i)»-*-(”+ i ) + (*- ir ("+ i )-... +( - 1) ‘ 1 .(»+ i ) 

k 

= )(k + l-j) n , n > 0,' k > 0. ( 13 ) 

j = 0 J 7 

Now let us study the generating function for runs. If we set 

= (.4) 

k 

the coefficient of z k is the probability that a random permutation of { 1 , 2 ,..., n) 
has exactly k runs. Since k runs are just as likely as n + 1 — k, the average number 
of runs must be |(n + l), hence g'Jl) = |(n + l). Exercise 2 (b) shows that there 
is a simple formula for all the derivatives of g„(z) at the point z = 1: 

(. 5 ) 

Thus in particular the variance g"(l) + g'Jl) - g' n (l) 2 comes to (n + 1)/12, for 
n > 2, indicating a rather stable distribution about the mean. (We found t his 
same quantity in Eq. 3.3.2-(i8), where it was called covar(f?i, i?').) Since g n (z) 
is a polynomial, we can use formula ( 15 ) to deduce the Taylor series expansions 

fcW “ a £<* - {It J } - a t ^ - *)-*« C: J }■ 

Ac— U k = 0 

(i6) 

The second of these equations follows from the first, since 

9n(z) = z n+1 g n (l/z), n> 1, ( 17 ) 

by the symmetry condition ( 7 ). The Stirling number recurrence 




gives two slightly simpler representations, 


= h £ ^ = { l }. (>s) 


when n > 1. The super generating function 
^ _ V' 9n{z)x n _ 


9 { z, x) = y. 9A ¥ 1 = e q 




I 


5.1.3 


RUNS 


39 


is therefore equal to 

V ((z-l)x) n {n\k\ / e (*-D*-l \ fc ( 1 - 2 ) 

2—* (z — l) k l fc J n! 2—i l z — 1 ) e ( z_1 ) x — z' 

k,n> 0 v ' fc>0 v ' 

this is another relation discussed by Euler. 

Further properties of the Eulerian numbers may be found in a survey pa- 
per by L. Carlitz [Math. Magazine 32 (1959), 247-260]. See also J. Riordan, 
Introduction to Combinatorial Analysis (New York: Wiley, 1958), 38-39, 214- 
219, 234-237; D. Foata and M. P. Schiitzenberger, Lecture Notes in Math. 138 
(Berlin: Springer, 1970). 

Let us now consider the length of runs; how long will a run be, on the 
average? We have already studied the expected number of runs having a given 
length, in Section 3.3.2; the average run length is approximately 2, in agreement 
with the fact that about | (n + 1 ) runs appear in a random permutation of 
length n. For applications to sorting algorithms, a slightly different viewpoint is 
useful; we will consider the length of the fcth run of the permutation from left to 
right, for k = 1 , 2 , . . . . 

For example, how long is the first (leftmost) run of a random permutation 
ax a 2 . . . a n ? Its length is always > 1, and its length is > 2 exactly one-half 
the time (namely when oi < 02 ). Its length is > 3 exactly one-sixth of the 
time (when a\ < a 2 < 03 ), and, in general, its length is > m with probability 
<lm — 1 /ml, for 1 < m < n. The probability that its length is exactly equal to m 
is therefore 

Pm — Qm q m + 1 = 1/m! - l/(m + 1)!, for 1 < m < n; 

p n = 1 /n!. ( 21 ) 

The average length of the first run therefore equals 

Pi+2p 2 -\ Vnp n - (<?i - 92 ) + 2(92-93)4 h(n- l)(g„_i -q n ) + nq n 

11 1 

- Ql +92 + ' ' '+9n - + - + •••+— . (22) 

If we let n — > 00 , the limit is e — 1 = 1.71828 . . . , and for finite n the value is 
e — 1 — 5 n where S n is quite small; 

1 1 \ e - 1 

n + 2 (n + 2 )(n + 3) / — (n + 1)! 

For practical purposes it is therefore convenient to study runs in a random infinite 
sequence of distinct numbers 




{n + 1 )! 


01,02,03,...; 

by “random” we mean in this case that each of the n! possible relative orderings 
of the first n elements in the sequence is equally likely. The average length of 
the first run in a random infinite sequence is exactly e — 1 . 

By slightly sharpening our analysis of the first run, we can ascertain the 
average length of the fcth run in a random sequence. Let qk m be the probability 


40 


SORTING 


5.1.3 


that the first k runs have total length > m; then q km is 1/m! times the number 
of permutations of {l,2,...,m} that have < k runs, 


Qkm — 



( 2 3) 


The probability that the first k runs have total length m is q km - q k (m+i)- 
Therefore if L k denotes the average length of the kth run, we find that 


L\ H + L k — average total length of first k runs 

= (<Zfci ~ Qk2) + 2(^2 — Qkz) + 3(gfe3 — qki) + ■ • • 

= Qkl + 1k2 + <7fc3 + • • • ■ 


Subtracting LH (-Tfc-i and using the value of q krn in (23) yields the desired 

formula 

Lk = h(k-i) + h(k-i) + h(k-i) + '" = S(fc- i)ii- (24) 

m> 1 

Since = 0 except when k — 1, L k turns out to be the coefficient of z k ~ 1 in 
the generating function g(z, 1) - 1 (see Eq. (19)), so we have 

L(Z) = ( 2 5) 

fc> 0 e 2 

From Euler’s formula (13) we obtain a representation of L*, as a polynomial in e: 


m> 0 j=0 J 

J=0 m> 0 7 — 0 m> 0 J 


_* (-l)k-ijk-i jn ^ (-l)fc— ijk- j - 1 ^ 

^ ( k ~j ) ! ^ n ' —'o ( k ~j- 1 )! ^ ri\ 


l=o 




(-1 r~‘i 


n> 0 

k-j-jk-j- 1 


j=0 




(26) 


This formula for L k was first obtained by B. J. Gassner [see CACM 10 (1967), 
89-93], In particular, we have 

Za = e- 1 ss 1.71828... ; 

L 2 = e 2 - 2e « 1.95249 . . . ; 

L z = e 3 - 3e 2 + |e « 1.99579 .... 

The second run is expected to be longer than the first, and the third run will 
be longer yet, on the average. This may seem surprising at first glance, but a 
moment’s reflection shows that the first element of the second run tends to be 


5.1.3 


RUNS 


41 


Table 2 

AVERAGE LENGTH OF THE fcTH RUN 


k 

L k 

k 

L k 

1 

1.71828 18284 59045+ 

10 

2.00000 00012 05997+ 

2 

1.95249 24420 12560- 

11 

2.00000 00001 93672+ 

3 

1.99579 13690 84285- 

12 

1.99999 99999 99909+ 

4 

2.00003 88504 76806- 

13 

1.99999 99999 97022- 

5 

2.00005 75785 89716+ 

14 

1.99999 99999 99719+ 

6 

2.00000 50727 55710- 

15 

2.00000 00000 00019+ 

7 

1.99999 96401 44022+ 

16 

2.00000 00000 00006+ 

8 

1.99999 98889 04744+ 

17 

2.00000 00000 00000+ 

9 

1.99999 99948 43434- 

18 

2.00000 00000 00000- 


small (it caused the first run to terminate); hence there is a better chance for 
the second run to go on longer. The first element of the third run will tend to 
be even smaller than that of the second. 

The numbers Lk are important in the theory of replacement-selection sorting 
(Section 5.4.1), so it is interesting to study their values in detail. Table 2 shows 
the first 18 values of Lk to 15 decimal places. Our discussion in the preceding 
paragraph might lead us to suspect at first that Lk+i > Lk , but in fact the values 
oscillate back and forth. Notice that Lk rapidly approaches the limiting value 2; 
it is quite remarkable to see these monic polynomials in the transcendental 
number e converging to the rational number 2 so quickly! The polynomials (26) 
are also somewhat interesting from the standpoint of numerical analysis, since 
they provide an excellent example of the loss of significant figures when nearly 
equal numbers are subtracted; using 19-digit floating point arithmetic, Gassner 
concluded incorrectly that L 12 > 2, and John W. Wrench, Jr., has remarked that 
42-digit floating point arithmetic gives jL 2 8 correct to only 29 significant digits. 

The asymptotic behavior of Lk can be determined by using simple principles 
of complex variable theory. The denominator of (25) is zero only when e 2_1 = z, 
namely when 

e x_1 cos y = x and e x_1 sin y = y, (27) 

if we write z = x + iy. Figure 3 shows the superimposed graphs of these two 
equations, and we note that they intersect at the points z = Zo, Z\, Z\, z 2 , z 2 , . . . , 
where Zq = 1, 


Z\ = (3.08884 30156 13044-) + (7.46148 92856 54255-) i, (28) 
and the imaginary part 3(zfc+i) is roughly equal to Ss(zfc) + 27T for large k. Since 

1 - z 


lim ( Z ") (z - z fe ) = -1, for k > 0, 
\e z_1 — z) 


and since the limit is —2 for k — 0, the function 

_ , , r , . 2Z Z Z Z Z 

Hfn( z ) — L(z) + — I 1 — r 1 b ■ 


Z — Zo z — Zi z — Zi z — z 2 z — z 2 


- + •••+- 


■ + • 


z — z, 


42 


SORTING 


5.1.3 


has no singularities in the complex plane for |z| < \z m+1 \. Hence R rn {z) has a 
power series expansion p k z k that converges absolutely when \z\ < |z m+1 |; it 
follows that p k M k ->■ 0 as k oo, where M - |z m+1 | - e. The coefficients of 
L(z) are the coefficients of 


2z z/z± 
l — z 1 — z/zi 
namely, 


f/fl , , z/z m , 

1 - z/z i 1 - z /z m ' 


Z / Z-m 
1 - z/z m 


+ Rm(z), 


L n = 2 + 2r x n cos n9 x + 2r 2 n cos n0 2 + • • • + 2r~" cos n0 m + 0(r~ n +1 ), (29) 

if we let 

z k = r k e i(>k . (go) 

This shows the asymptotic behavior of L n . We have 


tt = 8.07556 64528 89526-, 
r 2 = 14.35456 68997 62106-, 
r 3 = 20.62073 15381 80628-, 
r 4 = 26.88795 29424 54546-, 


9 1 = 1.17830 39784 74668+; 

0 2 = 1.31268 53883 87636+; 

0 3 = 1.37427 90757 91688-; 

0 4 = 1.41049 72786 51865-; (31) 


so the main contribution to L n — 2 is due to 7T and 9±, and convergence of 
(29) is quite rapid. Further analysis [W. W. Hooker, CACM 12 (1969), dll- 
413] shows that R m (z) cz for some constant c as m -loo; hence the series 
cos n0 k actually converges to L n when n > 1. (See also exercise 28.) 

A more careful examination of probabilities can be carried out to determine 
the complete probability distribution for the length of the fcth run and for the 

total length of the first k runs (see exercises 9, 10, 11). The sum L x -\ L k 

turns out to be asymptotically 2k - | + 0( 8~ k ). 


Let us conclude this section by considering the properties of runs when equal 
elements are allowed to appear in the permutations. The famous nineteenth- 
century American astronomer Simon Newcomb amused himself by playing a 
game of solitaire related to this question. He would deal a deck of cards into a 
pile, so long as the face values were in nondecreasing order; but whenever the 
next card to be dealt had a face value lower than its predecessor, he would start 
a new pile. He wanted to know the probability that a given number of piles 
would be formed after the entire deck had been dealt out in this manner. 

Simon Newcomb’s problem therefore consists of finding the probability dis- 
tribution of runs in a random permutation of a multiset. The general answer 
is rather complicated (see exercise 12), although we have already seen how to 
solve the special case when all cards have a distinct face value. We will content 
ourselves here with a derivation of the average number of piles that appear in 
the game. 

Suppose first that there are m different types of cards, each occurring exactly 
p times. An ordinary bridge deck, for example, has m = 13 and p = 4 if suits 
are disregarded. A remarkable symmetry applying to this case was discovered 


5.1.3 


RUNS 


43 


e x 1 sin y — y 
e x ~ 1 cos y = x 


Fig. 3. Roots of e 2 1 = 2 . 



by P. A. MacMahon [Combinatory Analysis 1 (Cambridge, 1915), 212-213]: 
The number of permutations with k + 1 runs is the same as the number with 
mp — p — k + 1 runs. When p = 1, this relation is Eq. ( 7 ), but for p > 1 it is 
quite surprising. 

We can prove the symmetry by setting up a one-to-one correspondence 
between the permutations in such a way that each permutation with k + 1 runs 
corresponds to another having mp — p — k + 1 runs. The reader is urged to try 
discovering such a correspondence before reading further. 

No very simple correspondence is evident; MacMahon’s proof was based 
on generating functions instead of a combinatorial construction. But Foata’s 
correspondence (Theorem 5.1.2B) provides a useful simplification, because it 
tells us that there is a one-to-one correspondence between multiset permutations 
with k + 1 runs and permutations whose two-line notation contains exactly k 
columns Ij. with x < y. 

Suppose the given multiset is {p • 1, p • 2, . . . , p ■ m}, and consider the 
permutation whose two-line notation is 


/ 1 ... 1 2 ... 2 ... m ... to \ 

V *11 • • • X\ p *21 . . . X2p - . * *mi . . . x m p J 

We can associate this permutation with another one, 

/I ... 1 2 ... 2 ... to ... to\ 

\ x 'll ••• x lp x 'ml x 'mp ••• *21 ••• x 2p ) 


( 32 ) 


(33) 


where x' = m + 1 — x. If ( 32 ) contains k columns of the form v x with x < y, then 
(33) contains (m — l)p—k such columns; for we need only consider the case y > 1 , 
and x < y is equivalent to a;' > m+2 — y. Now ( 32 ) corresponds to a permutation 


44 


SORTING 


5.1.3 


with k + 1 runs, and (33) corresponds to a permutation with mp-p-k + 1 runs, 
and the transformation that takes (32) into (33) is reversible — it takes (33) back 
into (32). Therefore MacMahon’s symmetry condition has been established. See 
exercise 14 for an example of this construction. 

Because of the symmetry property, the average number of runs in a random 
permutation must be |((fc + 1) + (mp - p - k + 1)) = 1 + 1 p(m - 1). For 
example, the average number of piles resulting from Simon Newcomb’s solitaire 
game using a standard deck will be 25 (so it doesn’t appear to be a very exciting 
way to play solitaire). 

We can actually determine the average number of runs in general, using a 
fairly simple argument, given any multiset {n x • x x , n 2 ■ x 2 , . . . , n m ■ x m } where 
the .r’s are distinct. Let n = ri\ + n 2 + • • • + n m , and imagine that all of the 
permutations a\ a 2 . . . a n of this multiset have been written down; we will count 
how often a, is greater than a i+ 1, for each fixed value of i, 1 < i < n. The 
number of times a* > a l+l is just half of the number of times a t / a i+1 ; and it 
is not difficult to see that a, = a i+1 = Xj exactly Nnj(nj - 1 )/n(n - 1) times, 
where N is the total number of permutations. Hence a, = a l+1 exactly 

n(n - 1) ( ni ( ni - !) + ' ' • + n m (n m - 1)) = — — + • • • + n 2 m - n) 

times, and a; > aj+i exactly 


N 


2 n(n — 1) 


{n 2 - (nj + ■ ■ ■ + n 2 J) 


times. Summing over i and adding N, since a run ends at a n in each permutation, 
we obtain the total number of runs among all N permutations: 


+ + + (34) 

Dividing by N gives the desired average number of runs. 

Since runs are important in the study of “order statistics,” there is a fairly 
large literature dealing with them, including several other types of runs not 
considered here. For additional information, see the book Combinatorial Chance 
by F. N. David and D. E. Barton (London: Griffin, 1962), Chapter 10; and the 
survey paper by D. E. Barton and C. L. Mallows, Annals of Math. Statistics 36 
(1965), 236-260. 


EXERCISES 

1. [M26] Derive Euler’s formula (13). 

► 2. [M22] (a) Extend the idea used in the text to prove (8), considering those se- 
quences (i 1 < 1 2 . . . o n that contain exactly q distinct elements, in order to prove the 
formula 


integer q > 0. 


5.1.3 


RUNS 


45 


(b) Use this identity to prove that 




m)!, 


for n > m. 


3. [HM25] Evaluate the sum ^ fc (£)(—l)*\ 

4. [M21] What is the value of £*,(-!)*{£} fc! ("m*) ? 

5. [M20] Deduce the value of (£) modp when p is prime. 

► 6 . [M21] Mr. B. C. Dull noticed that, by Eqs. ( 4 ) and ( 13 ), 

k> o x/c/ k>0j>0 yle 

Carrying out the sum on k first, he found that Sfc>o( _1 ) fc_J (fc-j) = 0 f° r 3 > 0 ; 
hence n\ = 0 for all n > 0. Did he make a mistake? 

7. [HM 4 O] Is the probability distribution of runs, given by ( 14 ), asymptotically 
normal? (See exercise 1.2.10-13.) 

8 . [M24] (P. A. MacMahon.) Show that the probability that the first run of a 
sufficiently long permutation has length h, the second has length l 2 , . . . , and the fcth 
has length > Ik, is 

l/(/i + I 2 )! l/(!i + h + fa)* ••• l/(2i + fa + fa + • • ■ + Ifc)! \ 

1 I//2! 1/(^2 + I3)! • • • 1/(^2 + I3 + • • • + 

det 0 1 l/»s! ••• l/(l 3 + ■■• + !*)! 

0 0 ... 1 1 /ljfe! 

9. [M30] Let hk(z) — 5 2PkmZ m , where pkm is the probability that m is the total 

length of the first k runs in a random (infinite) sequence. Find “simple” expressions 
for h\(z), (z ) , and the super generating function h(z,x) = ^2 k h k (z)x k . 

10. [HM30] Find the asymptotic behavior of the mean and variance of the distribu- 
tions hk(z) in the preceding exercise, for large k. 

11. [M40] Let Hk(z) ~ J2 F\mZ m , where P k m is the probability that m is the length 
of the fcth run in a random (infinite) sequence. Express Hi(z), H 2 (z), and the super 
generating function H(z,x) = H k (z)x k in terms of familiar functions. 

12. [M33] (P. A. MacMahon.) Generalize Eq. ( 13 ) to permutations of a multiset, by 
proving that the number of permutations of {ni ■ 1 , n 2 • 2 , . . . , n m ■ m} having exactly 
k runs is 

where n = ni + n 2 + 1 - n TO ,. 

13. [05] If Simon Newcomb’s solitaire game is played with a standard bridge deck, 
ignoring face value but treating clubs < diamonds < hearts < spades, what is the 
average number of piles? 

14. [Ml 8 ] The permutation 3111231423342244 has 5 runs; find the correspond- 
ing permutation with 9 runs, according to the text’s construction for MacMahon’s 
symmetry condition. 


46 


SORTING 


5.1.3 


► 15. [M21] ( Alternating runs.) The classical nineteenth-century literature of combi- 
natorial analysis did not treat the topic of runs in permutations, as we have considered 
them, but several authors studied “runs” that are alternately ascending and descending. 
Thus 53247618 was considered to have 4 runs: 5 3 2, 2 4 7, 761, and 18. (The first 
run would be ascending or descending, according as oi < a 2 or a! > a 2 ; thus a x a 2 . . . a„ 
and a n . . . a 2 a x and (n + 1 — a x )(n + 1 — a 2 ) . . . (n + 1 — q. n ) all have the same number 
of alternating runs.) When n elements are being permuted, the maximum number of 
runs of this kind is n — 1. 

Find the average number of alternating runs in a random permutation of the set 
{1, 2, . . . , n}. [Hint: Consider the proof of (34).] 

16. [M30] Continuing the previous exercise, let \ n k \ be the number of permutations 
of {1,2, ...,n} that have exactly k alternating runs. Find a recurrence relation, by 
means of which a table of ]^)( can be computed; and find the corresponding recurrence 
relation for the generating function G n (z) = Y)kll} zk / n '- Use the latter recurrence 
to discover a simple formula for the variance of the number of alternating runs in a 
random permutation of {1, 2 , ... , n}. 

17. [M25] Among all 2 n sequences a x a 2 . . . a„, where each a 3 is either 0 or 1, how 
many have exactly k runs (that is, k - 1 occurrences of a 3 > a 3+ i)? 

18. [ M28 ] Among all n! sequences 61 62 • • • b n such that each bj is an integer in the 
range 0 < bj < n - j, how many have (a) exactly k descents (that is, k occurrences of 
bj > bj+ 1)? (b) exactly k distinct elements? 




Fig. 4. Nonattacking rooks on a chessboard, with k = 3 rooks below the main diagonal. 

► 19. [M26] (I. Kaplansky and J. Riordan, 1946.) (a) In how many ways can n non- 
attacking rooks no two in the same row or column — be placed on an nxn chessboard, 
so that exactly k lie below the main diagonal? (b) In how many ways can k nonattacking 
rooks be placed below the main diagonal of an n x n chessboard? 

For example, Fig. 4 shows one of the 15619 ways to put eight nonattacking rooks 
on a standard chessboard with exactly three rooks in the unshaded portion below the 
main diagonal, together with one of the 1050 ways to put three nonattacking rooks on 
a triangular board. 

► 20. [ M21 ] A permutation is said to require k readings if we must scan it k times from 
left to right in order to read off its elements in nondecreasing order. For example, the 



5.1.4 


TABLEAUX AND INVOLUTIONS 


47 


permutation 491825367 requires four readings: On the first we obtain 1,2,3; on the 
second we get 4, 5, 6, 7; then 8; then 9. Find a connection between runs and readings. 

21 . [M22] If the permutation ai02 -..a n of {1,2, ...,n} has k runs and requires 
j readings, in the sense of exercise 20, what can be said about n„ . . . a 2 ai? 

22. [M26] (L. Carlitz, D. P. Roselle, and R. A. Scoville.) Show that there is no 
permutation of {1, 2 , . . . , n} with n + 1 — r runs, and requiring s readings, if rs < n; 
but such permutations do exist ifn>n + l— r>s>l and rs > n. 

23 . [ HM42 ] (Walter Weissblum.) The “long runs” of a permutation ai 0,2 . ■ . a n are 
obtained by placing vertical lines just before a segment fails to be monotonic; long 
runs are either increasing or decreasing, depending on the order of their first two 
elements, so the length of each long run (except possibly the last) is > 2. For example, 
75|62|389|14 has four long runs. Find the average length of the first two long 
runs of an infinite permutation, and prove that the limiting long-run length is 

(1 + cot i)/(3 - cot 1) « 2.4202. 

24 . [M30] What is the average number of runs in sequences generated as in exercise 
5.1.1-18, as a function of pi 

25. [ M25 ] Let Ui, ...,{/„ be independent uniform random numbers in [0 . . 1). What 
is the probability that \U\ + • • ■ + U„ J = kl 

26 . [ M20 ] Let •& be the operation z~, which multiplies the coefficient of z n in a 
generating function by n. Show that the result of applying d to 1/(1 — z) repeatedly, 
m times, can be expressed in terms of Eulerian numbers. 

► 27 . [M21] An increasing forest is an oriented forest in which the nodes are labeled 
{1, 2 , . . . , n} in such a way that parents have smaller numbers than their children. Show 
that ( ” ) is the number of n-node increasing forests with k + 1 leaves. 

28 . [ HM35 ] Find the asymptotic value of the numbers z m in Fig. 3 as m -> 00, and 
prove that J2™=i( z m + ^m 1 ) = e - 5/2. 

► 29. [ M30 ] The permutation a\ . . ,a n has a “peak” at a 3 if 1 < j < n and a 3 _ 1 < aj > 
Oj + i. Let Snk be the number of permutations with exactly k peaks, and let t nk be the 
number with k peaks and k descents. Prove that (a) s n k = 5) ( 2 n J + j ( 2fc + J + ^'{ 2k+2 j 
(see exercise 16); (b) s nk = 2 n ~ 1 ~ 2k t nk -, (c) '£ k (”>x fc = t nk x k (l + x) n-1-2fc . 

*5.1.4. Tableaux and Involutions 

To complete our survey of the combinatorial properties of permutations, we 
will discuss some remarkable relations that connect permutations with arrays 
of integers called tableaux. A Young tableau of shape (rti, n 2 , ■ . . , n m ), where 
n i > n -2 > ■ ■ ■ > n m > 0, is an arrangement of rii 4- n 2 + ■ • • + n m distinct 
integers in an array of left-justified rows, with n, elements in row i, such that 
the entries of each row are in increasing order from left to right, and the entries 
of each column are increasing from top to bottom. For example, 


1 

2 

5 

9 

10 

15 

3 

6 

7 

13 


4 

8 

12 

14 

11 




(1) 


48 


SORTING 


5.1.4 


is a Young tableau of shape (6, 4, 4, 1). Such arrangements were introduced by 
Alfred Young as an aid to the study of matrix representations of permutations 
[see Proc. London Math. Soc. (2) 28 (1928), 255-292; Bruce E. Sagan, The 
Symmetric Group (Pacific Grove, Calif.: Wadsworth & Brooks/Cole, 1991)]. For 
simplicity, we will simply say “tableau” instead of “Young tableau.” 

An involution is a permutation that is its own inverse. For example, there 
are ten involutions of {1, 2, 3, 4}, 


(l 2 3 4\ A 2 3 4\ 

V1234J V2134J 


A 2 3 4\ 
V3 2 1 4/ 


A 2 3 4\ 
\4 2 3 1 ) 


A 2 3 4\ 
VI 3 2 4j 


A 2 3 4\ A 2 3 4\ 
Vl 4 3 2y VI 2 4 3/ 


A 2 3 4\ A 2 3 4\ A 2 3 4\ 
V2 14 3/ V 3412 J V4 3 2 lJ 


( 2 ) 


The term “involution” originated in classical geometry problems; involutions in 
the general sense considered here were first studied by H. A. Rothe when he 
introduced the concept of inverses (see Section 5.1.1). 

It may appear strange that we should be discussing both tableaux and 
involutions at the same time, but there is an extraordinary connection be- 
tween these two apparently unrelated concepts: The number of involutions of 
{1,2, ... ,n} is the same as the number of tableaux that can be formed from the 
elements {1,2, . . . ,nj. For example, exactly ten tableaux can be formed from 
{1, 2, 3, 4}, namely, 


[I 


3 4 

1 

CO 

1 

4 




2_ 


2 

3_ 



1 

J] 

1 

2 3 

1 

3 


_3_ 


4 


2 

4 


4 


1 

« 

2 


4_ 


1 

2 

3 

4 


2 4 


1 

2 

3 

4 


(3) 


corresponding respectively to the ten involutions ( 2 ). 

This connection between involutions and tableaux is by no means obvious, 
and there is probably no very simple way to prove it. The proof we will discuss 
involves an interesting tableau-construction algorithm that has several other 
surprising properties. It is based on a special procedure that inserts new elements 
into a tableau. 


For example, suppose that we want to insert the element 8 into the tableau 


1 

3 

5 

9 

12 

16 

2 

6 

10 

15 


4 

13 

14 



11 





17 






(4) 




5.1.4 


TABLEAUX AND INVOLUTIONS 


49 


The method we will use starts by placing the 8 into row 1, in the spot previously 
occupied by 9, since 9 is the least element greater than 8 in that row. Element 9 is 
“bumped down” into row 2, where it displaces the 10. The 10 then “bumps” the 
13 from row 3 to row 4; and since row 4 contains no element greater than 13, the 
process terminates by inserting 13 at the right end of row 4. Thus, tableau ( 4 ) 
has been transformed into 


1 

3 

5 

8 

12 

16 

2 

6 

9 

15 


4 

10 

14 



11 

13 




17 






(5) 


A precise description of this process, together with a proof that it always 
preserves the tableau properties, appears in Algorithm I. 


Algorithm I ( Insertion into a tableau). Let P = (Pij) be a tableau of positive 
integers, and let r be a positive integer not in P. This algorithm transforms P 
into another tableau that contains x in addition to its original elements. The new 
tableau has the same shape as the old, except for the addition of a new position 
in row s, column t, where s and t are quantities determined by the algorithm. 

(Parenthesized remarks in this algorithm serve to prove its validity, since 
it is easy to verify inductively that the remarks are valid and that the array P 
remains a tableau throughout the process. For convenience we will assume that 
the tableau has been bordered by zeros at the top and left and with 00 ’s to the 
right and below, so that P tJ is defined for all i,j > 0. If we define the relation 

a < b if and only if a < b or a = b = 0 or a — b = 00 , ( 6 ) 

the tableau inequalities can be expressed in the convenient form 
Pij =0 if and only if i = 0 or j = 0; 

Pi] £ Pi(j+ 1 ) and P^ < P( i+ i)j, for all i,j > 0 . 

The statement “ x 0 P” means that either 1 = x or r / P tJ for all i,j > 0.) 

11. [Input x.] Set * •< — 1, set X\ <— x, and set j to the smallest value such that 
Pij = 00 . 

12. [Find aq+i.] (At this point P(i-i)j < aq < Pij and Xi £ P.) If Xi < P i ^_i' ) , 
decrease j by 1 and repeat this step. Otherwise set Xi + 1 «— Pij and set 
ri <- j- 

13. [Replace by Xj] (Now P i(J _ 1} < x { < x i+1 = P l3 < P i{j+ 1)( P (i -i)j < x { < 
Xi+i = Pij £ P(i+i)j, and ^ = j.) Set P tj <- aq. 

14. [Is x i+ i = 00 ?] (Now Pi(j-i) < Pij = Xi < x i+ i < Pi( j+ i), P(i-i)j < Pij = 

Xi < Xi + 1 < ri = j, and Xi + i £ P.) If Xj+i / 00 , increaise * by 1 and 

return to step 12 . 


50 


SORTING 


5.1.4 


15. [Determine s, t.] Set s <- i, t <- j, and terminate the algorithm. (At this 
point the conditions 

P st 7^ oo and P( s +\)t = P s (t+i) = oo (8) 

are satisfied.) | 

Algorithm I defines a “bumping sequence” 

a; = Xl < x 2 < ■ ■ ■ < x a < x 9+1 = oo, ( 9 ) 

as well as an auxiliary sequence of column indices 


ri > r 2 > ■ ■ ■ > r s = t; ( 10 ) 

element P ir . has been changed from x i+1 to x u for 1 < i < s. For example, 
when we inserted 8 into (4), the bumping sequence was 8, 9, 10, 13, 00, and the 
auxiliary sequence was 4, 3, 2, 2. We could have reformulated the algorithm so 
that it used much less temporary storage; only the current values of j, x u and 
x i+1 need to be remembered. But sequences (9) and (10) have been introduced 
so that we can prove interesting things about the algorithm. 

The key fact we will use about Algorithm I is that it can be run backwards: 
Given the values of s and t determined in step 15, we can transform P back 
into its original form again, determining and removing the element x that was 
inserted. For example, consider (5) and suppose we are told that element 13 is 
in the position that used to be blank. Then 13 must have been bumped down 
from row 3 by the 10, since 10 is the greatest element less than 13 in that row; 
similarly the 10 must have been bumped from row 2 by the 9, and the 9 must, 
have been bumped from row 1 by the 8. Thus we can go from (5) back to (4). 
The following algorithm specifies this process in detail: 


Algorithm D ( Deletion from a tableau). Given a tableau P and positive 
integers s, t satisfying (8), this algorithm transforms P into another tableau, 
having almost the same shape, but with 00 in column t of row s. An element x, 
determined by the algorithm, is deleted from P. 

(As in Algorithm I, parenthesized assertions are included here to facilitate 
a proof that P remains a tableau throughout the process.) 

Dl. [Input s, t] Set j <- t, i <- s, a: s+1 4- 00. 

D2. [Find Xi .] (At this point P tJ < x i+1 < P (i+1)j and x i+1 0 P.) If p i(j+1) < 
x i+ii increase j by 1 and repeat this step. Otherwise set x l 4— P- and 
x i <- j. 


U3. [Replace by x l+1 .\ (Now P i(j _ x) < P 2J = Xi < x i+1 ; 

Pij = Xi < x i+1 < P( i+ 1 )j, and n = j.) Set 4 - x i+1 
D4. [Is i = 1?] (Now P i(j _ 1} < Xi < x t+1 = P i:j < P i(j _ 
x i+i ~ Pij ~ P(i+i)j, and , r l = j.) If i > 1, decrease i 


0 + 1 ) > P(i — l)j < - 

P(i~l)j ^ x i 

1 and return to 


D5. [Determine x.\ Set x x^\ the algorithm terminates. (Now 0 < x < 00.) | 


5.1.4 


TABLEAUX AND INVOLUTIONS 


51 


The parenthesized assertions appearing in Algorithms I and D are not only a 
useful way to prove that the algorithms preserve the tableau structure; they also 
serve to verify that Algorithms I and D are perfect inverses of each other. If we 
perform Algorithm I first, given some tableau P and some positive integer x 0 P, 
it will insert x and determine positive integers s, t satisfying (8); Algorithm D 
applied to the result will recompute x and will restore P . Conversely, if we 
perform Algorithm D first, given some tableau P and some positive integers 
s, t satisfying (8), it will modify P, deleting some positive integer x\ Algorit hm I 
applied to the result will recompute s, t and will restore P. The reason is that the 
parenthesized assertions of steps 13 and D4 are identical, as are the assertions of 
steps 14 and D3, and these assertions characterize the value of j uniquely. Hence 
the auxiliary sequences (9), (10) are the same in each case. 

Now we are ready to prove a basic property of tableaux: 


Theorem A. There is a one-to-one correspondence between the set of all 
permutations of {1,2,..., rz} and the set of ordered pairs ( P,Q ) of tableaux 
formed from {1,2,..., n}, where P and Q have the same shape. 


(An example of this theorem appears within the proof that follows.) 

Proof. It is convenient to prove a slightly more general result. Given any two-line 
array 

f 9i <?2 q n \ qi < Qi < ••• < q n , . . 

\Pi P2 Pn ) ’ Pi,P2, ••• distinct, 

we will construct two corresponding tableaux P and Q, where the elements of P 
are {pi, . . . ,p n } and the elements of Q are {<71 , . . . , q n } and the shape of P is the 
shape of Q. 

Let P and Q be empty initially. Then, for i = 1, 2, . . . , n (in this order), 
do the following operation: Insert p* into tableau P using Algorithm I; then set 
Qst t— (p, where s and t specify the newly filled position of P. 

For example, if the given permutation is (J ^ ® ® *), we obtain 


Insert 7: 
Insert 2: 

Insert 9: 

Insert 5: 


P Q 

0 0 

r 

3 


2 

7 


2 

9 

7_ 


~2 

5 

]_ 

9 


1 

5 

A 



1 

5 

3 

6 


2 

3 


0 

5 

5 

9 


CO 

6 

7 


00 



(12) 


Insert 3: 


52 


SORTING 


5.1.4 


so the tableaux ( P,Q ) corresponding to (72953) are 



(! 3 ) 


It is clear from this construction that P and Q always have the same shape; 
furthermore, since we always add elements on the periphery of Q, in increasing 
order, Q is a tableau. 

Conversely, given two equal-shape tableaux P and Q, we can find the cor- 
responding two-line array (11) as follows. Let the elements of Q be 


qi < q 2 < ■ ■ ■ < q n - 

For i = n, . . . , 2, 1 (in this order), let p t be the element x that is removed when 
Algorithm D is applied to P, using the values s and t such that Q st = q l . 

For example, this construction will start with ( 13 ) and will successively undo 
the calculation ( 12 ) until P is empty, and (72953) ' s obtained. 

Since Algorithms I and D are inverses of each other, the two constructions 
we have described are inverses of each other, and the one-to-one correspondence 
has been established. | 

The correspondence defined in the proof of Theorem A has many start ling 
properties, and we will now proceed to derive some of them. The reader is urged 
to work out the example in exercise 1 , in order to become familiar with the 
construction, before proceeding further. 

Once an element has been bumped from row 1 to row 2 , it doesn’t affect 
row 1 any longer; furthermore rows 2, 3, . . . are built up from the sequence of 
bumped elements in exactly the same way as rows 1 , 2,... are built up from the 
original permutation. These facts suggest that we can look at the construction 
of Theorem A in another way, concentrating only on the first rows of P and Q. 
For example, the permutation ( 72953 ) causes the following action in row 1, 
according to ( 12 ): 

1: Insert 7, set Qn «— 1. 

3: Insert 2, bump 7. 

5 : Insert 9 , set Q12 5 . (14) 

6 : Insert 5, bump 9. 

8 : Insert 3, bump 5. 

Thus the first row of P is 2 3, and the first row of Q is 1 5. Furthermore, the 
remaining rows of P and Q are the tableaux corresponding to the “bumped” 
two-line array 

(?!!)■ <-> 

In order to study the behavior of the construction on row 1, we can consider 
the elements that go into a given column of this row. Let us say that (qi,Pi) is 


5.1.4 


TABLEAUX AND INVOLUTIONS 


53 


in class t with respect to the two-line array 


fli 92 ••• 9 n\ 9 i < 92 < < 9 n, , , 

\Pi P 2 ■■■ Pn ) ’ pi,p 2 , distinct, ^ ' 

if pi = P u after Algorithm I has been applied successively to Pi,P 2 , ■ ■ ■ ,Pi, 

starting with an empty tableau P. (Remember that Algorithm I always inserts 

the given element into row 1.) 

It is easy to see that ( qi,Pi ) is in class 1 if and only if p, has i — 1 inversions, 
that is, if and only if p* = min{pi,p2, . . . ,Pi} is a “left- to- right minimum.” If we 
cross out the columns of class 1 in (16), we obtain another two-line array 


( <?2 • • • q'm \ 

VP'l P2 ••• Pm) 


( 17 ) 


such that ( q,p ) is in class t with respect to (17) if and only if it is in class t+1 with 
respect to (16). The operation of going from (16) to (17) represents removing 
the leftmost position of row 1. This gives us a systematic way to determine the 
classes. For example in (72953) the elements that are left-to-right minima are 
7 and 2, so class 1 is {(1,7), (3,2)}; in the remaining array (jj ® ®) all elements 
are minima, so class 2 is {(5, 9), (6, 5), (8, 3)}. In the “bumped” array (15), class 
1 is {(3,7), (8,5)} and class 2 is {(6,9)}. 

For any fixed value of t, the elements of class t can be labeled 


(?«i ), • • • , ( qi k ,Pi k ) 


in such a way that 

9fi < 9 i 2 < ' ' ‘ < 9* fc , , . 

Ph > Pi 2 > •■■ > Pi k , [ } 

since the tableau position P u takes on the decreasing sequence of values p M , . . . , 
p ik as the insertion algorithm proceeds. At the end of the construction we have 


P*lt Pi k , Qu — Qit , 


(!9) 


and the “bumped” two-line array that defines rows 2, 3, . . . of P and Q contains 
the columns 


f 9*2 9*3 • • ■ 9u A 

\Ph Pi 2 ••• Pi k -x) 


(20) 


plus other columns formed in a similar way from the other classes. 

These observations lead to a simple method for calculating P and Q by 
hand (see exercise 3), and they also provide us with the means to prove a rather 
unexpected result: 

Theorem B. If the permutation 


1 2 ... n 
Cl 1 • • • CLfi 

corresponds to tableaux ( P,Q ) in the construction of Theorem A, then the 
inverse permutation corresponds to ( Q,P ). 


54 


SORTING 


5.1.4 


This fact is quite startling, since P and Q are formed by such completely 
different methods in Theorem A, and since the inverse of a permutation is 
obtained by juggling the columns of the two-line array rather capriciously. 

Proof. Suppose that we have a two-line array (16); its columns are essentially 
independent and can be rearranged. Interchanging the lines and sorting the 
columns so that the new top line is in increasing order gives the “inverse” array 

<12 ■■■ q n \ _ fpi P2 ■■■ Pn \ 

\Pi P2 ... Pn) \qi q 2 ... q n ) 

= (Pi P 2 ■■■ Pi < P 2 <■■■< Pni , , 

\ 9 i I2 ■■■ q'n ) ’ q[, q ' 2 , ..., q' n distinct. 

We will show that this operation corresponds to interchanging P and Q in the 
construction of Theorem A. 

Exercise 2 reformulates our remarks about class determination so that the 
class of ( qiiPi ) doesn’t depend on the fact that qi,q 2 , ■ . ■ ,q n are in ascending 
order. Since the resulting condition is symmetrical in the q’s and the p’s, the 
operation (21) does not destroy the class structure; if ( q,p ) is in class t with 
respect to (16), then (p, q) is in class t with respect to (21). If we therefore 
arrange the elements of the latter class t as 

Pi k < ■■■ <Pi 2 <Ph, 

<hk > ■■■ > qi 2 > Qh . 

by analogy with (18), we have 

Pit = , Qit = Pi k (23) 

as in (19), and the columns 


f Pik - 1 • • • Pi 2 Pil A 
\ qik ■ ■ • qi 3 <li2 ) 


( 2 4) 


go into the “bumped” array as in (20). Hence the first rows of P and Q are 
interchanged. Furthermore the “bumped” two-line array for (21) is the inverse 
of the bumped two-line array for (16), so the proof is completed by induction 
on the number of rows in the tableaux. | 


Corollary B. The number of tableaux that can be formed from {1, 2, . . . , n} is 
the number of involutions on {1,2 ,..., n}. 

Proof. If 7 r is an involution corresponding to ( P,Q ), then 7r = n~ corresponds 
to ( Q,P ); hence P = Q. Conversely, if 7r is any permutation corresponding 
to ( P,P ), then 7 r“ also corresponds to (P,P); hence 7r = tt~. So there is a 
one-to-one correspondence between involutions ir and tableaux P. | 

It is clear that the upper-left corner element of a tableau is always the 
smallest. This suggests a possible way to sort a set of numbers: First we can 
put the numbers into a tableau, by using Algorithm I repeatedly; this brings the 
smallest element to the corner. Then we delete the smallest element, rearranging 


5.1.4 


TABLEAUX AND INVOLUTIONS 


55 


the remaining elements so that they form another tableau; then we delete the 
new smallest element; and so on. 

Let us therefore consider what happens when we delete the corner element 
from the tableau 


1 

3 

5 

7 

11 

15 

2 

6 

8 

14 


4 

9 

13 



10 

12 




16 






( 2 5) 


If the 1 is removed, the 2 must come to take its place. Then we can move the 
4 up to where the 2 was, but we can’t move the 10 to the position of the 4; the 
9 can be moved instead, then the 12 in place of the 9. In general, we are led to 
the following procedure. 


Algorithm S ( Delete comer element). Given a tableau P, this algorithm deletes 
the upper left corner element of P and moves other elements so that the tableau 
properties are preserved. The notational conventions of Algorithms I and D are 
used. 

51. [Initialize.] Set r «— 1, s <— 1. 

52. [Done?] If P rs — 00 , the process is complete. 

53. [Compare.] If P( r +i)s & P r (s+ 1 ), g° to step S5. (We examine the elements 
just below and to the right of the vacant cell, and we will move the smaller 
of the two.) 

54. [Shift left.] Set P rs <— P r (s+i), s <— s + 1, and return to S3. 

55. [Shift up.] Set P ra «— P( r+ i) s , r <— r + 1, and return to S2. | 

It is easy to prove that P is still a tableau after Algorithm S has deleted its 
corner element (see exercise 10). So if we repeat Algorithm S until P is empty, 
we can read out its elements in increasing order. Unfortunately this doesn’t 
turn out to be as efficient a sorting algorithm as other methods we will see; its 
minimum running time is proportional to n 15 , but similar algorithms that use 
trees instead of tableau structures have an execution time on the order of n log n. 

In spite of the fact that Algorithm S doesn’t lead to a superbly efficient 
sorting algorithm, it has some very interesting properties. 

Theorem C (M. P. Schutzenberger). If P is the tableau formed by the con- 
struction of Theorem A from the permutation Oi a 2 . . . a n , and if 

di = min{ai,a 2 , . . . ,a„}, 

then Algorithm S changes P to the tableau corresponding to a 1 . . . a.,_ia I+ i. . . a„. 
Proof. See exercise 13. | 


56 


SORTING 


5.1.4 


After we apply Algorithm S to a tableau, let us put the deleted element into 
the newly vacated place P rs , but in italic type to indicate that it isn’t really part 
of the tableau. For example, after applying this procedure to the tableau (25) 
we would have 


2 

3 

5 

7 

11 

15 

4 

6 

8 

14 


9 

12 

13 



10 

1 




16 






and two more applications yield 


4 

5 

7 

11 

15 

* 

6 

8 

13 

14 


9 

12 

3 



10 

1 




16 






Continuing until all elements are removed gives 


16 

14 

13 

12 

10 

2 

15 

9 

6 

4 


11 

5 

3 



8 

1 




7 






(26) 


which has the same shape as the original tableau (25). This configuration may 
be called a dual tableau, since it is like a tableau except that the “dual order” 
has been used (reversing the roles of < and >). Let us denote the dual tableau 
formed from P in this way by the symbol P s . 

From P s we can determine P uniquely; in fact, we can obtain the original 
tableau P from P s , by applying exactly the same algorithm — but reversing the 
order and the roles of italic and regular type, since P s is a dual tableau. For 
example, two steps of the algorithm applied to (26) give 


14 

13 

12 

10 

2 

15 

11 

9 

6 

4 


8 

5 

3 



7 

1 




16 






and eventually (25) will be reproduced again! This remarkable fact is one of the 
consequences of our next theorem. 


5.1.4 


TABLEAUX AND INVOLUTIONS 


57 


Theorem D (C. Schensted, M. P. Schiitzenberger). Let 

( 9i 92 • • • 9n \ 

\Pl P2 ••• PnJ 


(27) 


be the two-line array corresponding to the tableaux ( P,Q ). 

a) Using dual (reverse) order on the q’s, but not on the p’s, the two-line array 


fq n ■■■ 92 qi\ 

\Pn ••• P 2 Pi) 


(28) 


corresponds to (P T , (Q S ) T ). 

As usual, “T” denotes the operation of transposing rows and columns; P T is a 
tableau, while ( Q S ) T is a dual tableau, since the order of the q’s is reversed. 

b) Using dual order on the p’s, but not on the q’s, the two-line array (27) 
corresponds to ((P S ) T ,Q T ) ■ 

c) Using dual order on both the p’s and the q’s, the two-line array (28) corre- 
sponds to ( P S ,Q S ). 


Proof. No simple proof of this theorem is known. The fact that case (a) 
corresponds to ( P T ,X ) for some dual tableau X is proved in exercise 5; hence 
by Theorem B, case (b) corresponds to ( Y,Q T ) for some dual tableau Y, and 

Y must have the shape of P r . 

Let pi — min{pi, . . . ,p n }; since p, is the “largest” element in the dual order, 
it appears on the periphery of Y, and it doesn’t bump any elements in the con- 
struction of Theorem A. Thus, if we successively insert pi, . . . ,p«-i,p*+i, • . . ,p n 
using the dual order, we get Y — {pi}, that is, Y with p, removed. By Theorem C 
if we successively insert pi, . . . ,Pi-i,Pi+u . . . ,p n using the normal order, we get 
the tableau d(P) obtained by applying Algorithm S to P. By induction on n, 

Y — {Pi} — (d(P) s ) T - But since 

(P S ) T - { Pi } = (d(P) s ) T , (29) 

by definition of the operation S, and since Y has the same shape as (P S ) T , we 
must have Y = (P S ) T . 

This proves part (b), and part (a) follows by an application of Theorem B. 
Applying parts (a) and (b) successively then shows that case (c) corresponds 
to ({(P T ) S ) T ,{(Q S ) T ) T )', and this is ( P S ,Q S ) since ( P S ) T = ( P T ) S by the 
row-column symmetry of operation S. | 


In particular, this theorem establishes two surprising facts about the tableau 
insertion algorithm: If successive insertion of distinct elements pi , . . . , p n into an 
empty tableau yields tableau P, insertion in the opposite order p n , ■ ■ ■ ,p\ yields 
the transposed tableau P T . And if we not only insert the p’s in this order 
p n , ... ,Pi but also interchange the roles of < and >, as well as 0 and 00, in 
the insertion process, we obtain the dual tableau P s . The reader is urged to 
try out these processes on some simple examples. The unusual nature of these 
coincidences might lead us to suspect that some sort of witchcraft is operating 


58 


SORTING 


5.1.4 


behind the scenes! No simple explanation for these phenomena is yet known; 
there seems to be no obvious way to prove even that case (c) corresponds to 
tableaux having the same shape as P and Q, although the characterization of 
classes in exercise 2 does provide a significant clue. 

The correspondence of Theorem A was given by G. de B. Robinson [Amer- 
ican J. Math. 60 (1938), 745-760, §5], in a somewhat' vague and different form, 
as part of his solution to a rather difficult problem in group theory. Robinson 
stated Theorem B without proof. Many years later, C. Schensted independently 
rediscovered the correspondence, which he described in terms of “bumping” as 
we have done in Algorithm I; Schensted also proved the “P” part of Theorem 
D(a) [see Canadian J. Math. 13 (1961), 179-191], M. P. Schiitzenberger [Math. 
Scand. 12 (1963), 117-128] proved Theorem C and the “Q" part of Theorem 
D(a), from which (b) and (c) follow. It is possible to extend the correspondence 
to permutations of multisets ; the case that pi , . . . , p n need not be distinct was 
considered by Schensted, and the “ultimate” generalization to the case that both 
the p’s and the q’s may contain repeated elements was investigated by Knuth 
[Pacific J. Math. 34 (1970), 709-727], 

Let us now turn to a related question: How many tableaux formed from 

{1,2, . . . ,n} have a given shape (ni,n 2 , . . .,n m ), where n 1 +n 2 -\ hn m = n? 

If we denote this number by f(n 1 ,n 2 , and if we allow the parameters nj 

to be arbitrary integers, the function / must satisfy the relations 

/(ni,n 2 , . . . ,n m ) — 0 unless n x > n 2 > • • • > n m > 0; 

/(ni,n 2 , . . . ,n m ,0) = f{n x ,n 2 , . . . ,n m ); 

/(ni,n 2 , . . . ,n m ) = f{n\ — l,n 2 , . . . ,n m ) + /(ni,n 2 — 1, . . . ,n m ) 

A f {n i , n 2 , . . . , n m 1) , 

if ni > n 2 > > n m > 1. (32) 

Recurrence (32) comes from the fact that a tableau with its largest element 
removed is always another tableau; for example, the number of tableaux of shape 
(6, 4, 4, 1) is /( 5, 4, 4, 1) + /( 6, 3, 4, 1) + /( 6, 4, 3, 1) + /(6, 4, 4, 0) = /( 5, 4, 4, 1) + 
/(6,4, 3, 1) + /(6,4, 4), since every tableau of shape (6, 4, 4, 1) on {1,2,..., 15} 
is formed by inserting the element 15 into the appropriate place in a tableau of 
shape (5,4,4, 1), (6,4, 3,1), or (6,4,4). Schematically, 



The function f{n u n 2 ,...,n m ) that satisfies these relations has a fairly 
simple form, 


(30) 

(31) 


f (n-i , n 2 , . . . , n m ) 


A(m + m - 1, n 2 + to - 2, . . . , n m ) n\ 
(ni + to - 1 )! (n 2 + to - 2 )! ... n m \ 


(34) 


5.1.4 


TABLEAUX AND INVOLUTIONS 


59 


provided that the relatively mild conditions 


ni+m-l>n 2 +m- 2 >--->n m 

are satisfied; here A denotes the “square root of the discriminant” function 


A(xi,x 2 , . . . , x m ) — det 


m-l m — 1 

X 1 x 2 


o-i 

Xi 

1 


x 2 

X 2 

1 


1 \ 
\ 


l / 


= (Xi-Xj). ( 35 ) 


Formula ( 34 ) was derived by G. Frobenius [Sitzungsberichte preuB. Akad. der 
Wissenschaften (1900), 516-534, §3], in connection with an equivalent problem 
in group theory, using a rather deep group-theoretical argument; a combinatorial 
proof was given independently by MacMahon [Philosophical Trans. A209 (1909), 
153-175]. The formula can be established by induction, since relations ( 30 ) and 
( 31 ) are readily proved and ( 32 ) follows by setting y - -1 in the identity of 
exercise 17. 

Theorem A gives a remarkable identity in connection with this formula for 
the number of tableaux. If we sum over all shapes, we have 


n! = f(ki,k 2 ...,k n ) 2 

ki >k2>‘-->k n >0 
k\+k 2 -\ ( -k n =n 


= n - 2 

ki>k 2 >->k n >0 

kl+k 2 -\ \-kn=7l 


A(fcx + n - 1, fc 2 + n - 2, . . . , k n ) 2 
(fci + n - l )! 2 (k 2 + n - 2)! 2 . . . k n \ 2 


= n ' 2 

gi>92>"->9n>0 

\-q n =(n+l)n/2 


%^ 2 v,fe) 2 , 

qi'. 2 q 2 \ 2 ...g „! 2 


hence 


E 

91+92 H l~9n=(^+ l)n/2 

9i,92, ...,9«>0 


A(gi,g 2 ,...,g n ) 2 
qi\ 2 q 2 \ 2 ...q n \ 2 


(36) 


The inequalities q\ > q 2 > • ■ ■ > q n have been removed in the latter sum, since 
the summand is a symmetric function of the r/’s that vanishes when q. t = qj. 
A similar identity appears in exercise 24. 

The formula for the number of tableaux can also be expressed in a much 
more interesting way, based on the idea of “hooks.” The hook corresponding to 
a cell in a tableau is defined to be the cell itself plus the cells lying below and 
to its right. For example, the shaded area in Fig. 5 is the hook corresponding to 
cell (2, 3) in row 2, column 3; it contains six cells. Each cell of Fig. 5 has been 
filled in with the length of its hook. 


60 


SORTING 


5.1.4 



Fig. 5. Hooks and hook lengths. 

If the shape of the tableau is (ni, n 2 , . . . , n m ), the longest hook has length 
ni+m-1. Further examination of the hook lengths shows that row 1 con- 
tains all the lengths ni+m — 1, ni+m — 2, . . . , 1 except for (ni+m — 1) — (n TO ), 
( n i +to— 1) — (n m _i + 1), . . . , (ni+m—1) — (n 2 + m — 2). In Fig. 5, for example, 
the hook lengths in row 1 are 12, 11, 10, ..., 1 except for 10, 9, 6, 3, 2; the 
exceptions correspond to five nonexistent hooks, from nonexistent cells (6,3), 
(5,3), (4,5), (3,7), (2,7) leading up to cell (1,7). Similarly, row j contains 
all lengths nj + m-j, ..., 1, except for (riy + m- j) - (n m ), ..., (■ nj + m-j )- 
(nj+i+m—j — l). It follows that the product of all the hook lengths is equal to 

(ni+m-1)! (n 2 + m-2 )! . . ,n m ! 

A(ni+m-l,n 2 -(-m-2 , . . . ,n m ) ' 

This is just what happens in Eq. (34), so we have derived the following celebrated 
result due to J. S. Frame, G. de B. Robinson, and R. M. Thrall [Canadian J. 
Math. 6 (1954), 316-318]: 

Theorem H. The number of tableaux on {1,2,..., nj having a specified shape 
is n! divided by the product of the hook lengths. | 

Since this is such a simple rule, it deserves a simple proof; a heuristic 
argument runs as follows: Each element of the tableau is the smallest in its 
hook. If we fill the tableau shape at random, the probability that cell (i,j) will 
contain the minimum element of the corresponding hook is the reciprocal of the 
hook length; multiplying these probabilities over all i and j gives Theorem H. 
But unfortunately this argument is fallacious, since the probabilities are far from 
independent! No direct proof of Theorem H, based on combinatorial properties of 
hooks used correctly, was known until 1992 (see exercise 39), although researchers 
did discover several instructive indirect proofs (exercises 35, 36, and 38). 

Theorem H has an interesting connection with the enumeration of trees, 
which we considered in Chapter 2. We observed that binary trees with n nodes 
correspond to permutations that can be obtained with a stack, and that such 
permutations correspond to sequences ai a 2 . . . a 2n of n S’s and n X’s, where the 
number of S’s is never less than the number of X’s as we read from left to right. 
(See exercises 2.2. 1-3 and 2.3. 1-6.) The latter sequences correspond in a natural 
way to tableaux of shape (n, n); we place in row 1 the indices i such that a, = S, 
and in row 2 we put those indices with a, = X. For example, the sequence 

sssxxssxxsxx 


5.1.4 


TABLEAUX AND INVOLUTIONS 


61 


corresponds to the tableau 


1 

2 

3 

6 

7 

10 

4 

5 

00 

9 

11 

12 


(37) 


The column constraint is satisfied in this tableau if and only if the number of X’s 
never exceeds the number of S’s from left to right. By Theorem H, the number 
of tableaux of shape (n, n) is 

(2 n)! , 

( n + 1)! n\ ’ 


so this is the number of binary trees, in agreement with Eq. 2.3.4.4-(i4). Further- 
more, this argument solves the more general “ballot problem” considered in 
the answer to exercise 2.2. 1-4, if we use tableaux of shape (n, to) for n > to. 
So Theorem H includes some rather complex enumeration problems as simple 
special cases. 

Any tableau A of shape (n, n) on the elements {1,2, ...,2n} corresponds 
to two tableaux (P, Q) of the same shape, in the following way suggested by 
MacMahon [Combinatory Analysis 1 (1915), 130-131]: Let P consist of the ele- 
ments {1, . . . , n} as they appear in A; then Q is formed by taking the remaining 
elements, rotating the configuration by 180°, and replacing n + 1, n + 2, . . . , 2n 
by n, n — 1, . . . , 1, respectively. For example, (37) splits into 


1 

2 

3 

6 

4 

5 



and 


rotation and renaming of the latter yields 
P = 


1 

2 

3 

6 

4 

5 



Q = 



7 

10 

8 

9 

11 

12 


1 

2 

4 

m 

3 

6 



(38) 


Conversely, any pair of equal-shape tableaux of at most two rows, each containing 
n cells, corresponds in this way to a tableau of shape (n, n). Hence by exercise 7 
the number of permutations a x a 2 ■ ■ . a n of {1,2,..., n} containing no decreasing 
subsequence a, > o,j > a*, for i < j < k is the number of binary trees with 
n nodes. An interesting one-to-one correspondence between such permutations 
and binary trees, more direct than the roundabout method via Algorithm I that 
we have used here, has been found by D. Rotem [Inf. Proc. Letters 4 (1975), 
58-61]; similarly there is a rather direct correspondence between binary trees 
and permutations having no instances of a* > a*, > a,j for i < j < k (see exercise 
2. 2. 1-5). 

The number of ways to fill a tableau of shape (6, 4, 4, 1) is obviously the 
number of ways to put the labels {1,2,..., 15} onto the vertices of the directed 
graph 



62 


SORTING 


5.1.4 


in such a way that the label of vertex u is less than the label of vertex v whenever 
u->v. In other words, it is the number of ways to sort the partial ordering ( 39 ) 
topologically, in the sense of Section 2.2.3. 

In general, we can ask the same question for any directed graph that contains 
no oriented cycles. It would be nice if there were some simple formula generalizing 
Theorem H to the case of an arbitrary directed graph; but not all graphs have 
such pleasant properties as the graphs corresponding to tableaux. Some other 
classes of directed graphs for which the labeling problem has a simple solution 
are discussed in the exercises at the close of this section. Other exercises show 
that some directed graphs have no simple formula corresponding to Theorem H. 
For example, the number of ways to do the labeling is not always a divisor of n\. 

To complete our investigations, let us count the total number of tableaux 
that can be formed from n distinct elements; we will denote this number by t n . 
By Corollary B, t n is the number of involutions of { 1 , 2, . . . , u j . A permutation 
is its own inverse if and only if its cycle form consists solely of one-cycles (fixed 
points) and two-cycles (transpositions). Since t n _j of the t n involutions have 
(n) as a one-cycle, and since t „_ 2 of them have (j n) as a two-cycle, for fixed 
j < n, we obtain the formula 


tn — t n - 1 + (n — l)t n _ 2 , ( 40 ) 

which Rothe devised in 1800 to tabulate t n for small n. The values for n > 0 
are 1, 1, 2, 4, 10, 26, 76, 232, 764, 2620, 9496, .... 

Counting another way, let us suppose that there are k two-cycles and (n— 21c) 
one-cycles. There are ( ? ” ) ways to choose the fixed points, and the multinomial 
coefficient ( 2 fc)!/( 2 !) fc is the number of ways to arrange the other elements 
into k distinguishable transpositions; dividing by A:! to make the transpositions 
indistinguishable we therefore obtain 


[n/2J 

tn ^ ^ t n (k ), 

k = 0 


t n {k) - 


n! 

(n — 2k)\ 2 k k\ 


(41) 


Unfortunately, this sum has no simple closed form (unless we choose to regard the 
Hermite polynomial i n 2 n,/ 2 f7„(— i/%/ 2 ) as simple), so we resort to two indirect 
approaches in order to understand t n better: 

a) We can find the generating function 


]Tt n z n /n! = e*+ z2 / 2 ; ( 42 ) 

n 

see exercise 25. 

b) We can determine the asymptotic behavior of t n . This is an instructive 
problem, because it involves some general techniques that will be useful to 
us in other connections, so we will conclude this section with an analysis of 
the asymptotic behavior of t n . 


5.1.4 


TABLEAUX AND INVOLUTIONS 


63 


The first step in analyzing the asymptotic behavior of (41) is to locate the 
main contribution to the sum. Since 

t n (k + 1) (n — 2k)(n — 2fc — 1) . 

“W = WTi) ' (43) 

we can see that the terms gradually increase from k — 0 until t n (k + 1) ss t n (k) 
when k is approximately |(n — y/n ) ; then they decrease to zero when k exceeds 
|n. The main contribution clearly comes from the vicinity of k = |(n — y/n). 
It is usually preferable to have the main contribution at the value 0, so we write 

k — \(n - y/n) + x, (44) 

and we will investigate the size of t n (k) as a function of x. 

One useful way to get rid of the factorials in t n (k) is to use Stirling’s 
approximation, Eq. 1.2.11.2-(i8). For this purpose it is convenient (as we shall 
see in a moment) to restrict x to the range 

-n c+ 1 / 4 < x < n e+1 / 4 , (45) 


where e = 0.001, say, so that an error term can be included. A somewhat 
laborious calculation, which the author did by hand in the 60s but which is now 
easily done with the help of computer algebra, yields the formula 

t n (k) = exp(|nlnn — |n + y/n — | Inn — 2x 2 /y/n — | | ln7r 

— |x 3 /n + 2x/y/n + \/\/n — | x 4 /ny/n + 0(n 5e ~ 3 ^ 4 )) . (46) 

The restriction on x in (45) can be justified by the fact that we may set x = 
±n e+1 / 4 to get an upper bound for all of the discarded terms, namely 

e~ 2n exp(|nlnn - |n + y/n - | Inn - | | ln7r + 0(n 3e ~ 4 / 4 )), (47) 

and if we multiply this by n we get an upper bound for the sum of the excluded 
terms. The upper bound is of lesser order than the terms we will compute for 
x in the restricted range (45), because of the factor exp(— 2n 2e ), which is much 
smaller than any polynomial in n. 

We can evidently remove the factor 

exp(|nlnn — \n + y/n - \ Inn — | — \ ln7r + \/Vn) (48) 


from the sum, and this leaves us with the task of summing 
exp(— 2 x 2 /y/n — |x 3 /n + 2 x/y/n — | x^/ny/n + 0(n 5e_3 ^ 4 )) 


-2x 2 

y/n 


4 x 3 8 x 6 
3 n 9 n 2 


x x 

1 + 2 — =+ 2 — 
y/n n 


— exp 




64 


SORTING 


5.1.4 


over the range x — a, a+l, ..., /? — 2 , (3 — 1, where —a and j3 are approximately 
equal to n e+1 / 4 (and not necessarily integers). Euler’s summation formula, 
Eq. 1.2.11.2-(io), can be written 

Y /(*) = 

a<x<(3 


f 


f (x) dx — -f(x) 


2 1 ! 


+ ••• + 


1 


TO + 1 


B, 


m+1 


f im) (x) 


+ R 


m+1 j 


( 50 ) 


by translation of the summation interval. Here \R m |<(4/(2+™)/f|/M(z)|dx. 
If we let f(x) = x* exp (—2a: 2 / y/n ) , where t is a fixed nonnegative integer, Euler’s 
summation formula will give an asymptotic series for f( x ) as n -> 00 , since 


f {m) (x) = n^-^/V "*)^- 1 / 4 x), g(y) = /e" 2 ^, ( 51 ) 


and g(y) is a well-behaved function independent of n. The derivative g irrl Hy) is 
e~ 2 y 2 times a polynomial in y, hence R m = 0(n^ t+1_m )/ 4 ) \g^ m \y)\dy = 
0(n( t+1 ~ m )/ 4 ). Furthermore if we replace a and [3 by — cxd and +00 in the right- 
hand side of ( 50 ), we make an error of at most 0 (exp(- 2 n 2e )) in each term. 
Thus 

f( x ) = 

a<x<p 

only the integral is really significant, given this particular choice of f(x)\ The 
integral is not difficult to evaluate (see exercise 26), so we can multiply out and 
sum formula ( 49 ), giving y / 7 r/ 2 (n 1 / 4 - ^n -1 / 4 + 0(n -1 / 2 )). Thus 

tn = in"/ 2 e-'‘/ 2 ^- 1 / 4 (l + X n -V2 + 0 {n - 3 / 4 )). ( 53 ) 



/: 


f(x) dx + 0(n m ), for all to > 0 ; 


(52) 


Actually the O-terms here should have an extra 9e in the exponent, but our 
manipulations make it clear that this 9e would disappear if we had carried further 
accuracy in the intermediate calculations. In principle, the method we have 
used could be extended to obtain 0(n~ k ) for any k, instead of <9(n~ 3/4 ). This 
asymptotic series for t n was first determined (using a different method) by Moser 
and Wyman, Canadian J. Math. 7 (1955), 159-168. 

The method we have used to derive ( 53 ) is an extremely useful technique for 
asymptotic analysis that was introduced by P. S. Laplace [Memoires Acad. Sci. 
(Paris, 1782), 1-88]; it is discussed under the name “trading tails” in CMath , 
§9.4. For further examples and extensions of tail-trading, see the conclusion of 
Section 5.2.2. 


EXERCISES 

1. [16] What tableaux ( P,Q ) correspond to the two-line array 

(1 2 3 4 5 6 7 8 9\ 

(6 49571283^’ 


5.1.4 


TABLEAUX AND INVOLUTIONS 


65 


in the construction of Theorem A? What two-line array corresponds to the tableaux 


1 

4 

7 

2 

00 


5 

9 


1 

3 

7 

4 

5 


8 

9 


2. [ M21 ] Prove that (q,p) belongs to class t with respect to (16) if and only if t is 
the largest number of indices ii , . . . , it such that 


Ph <Pi 2 <■■■ < Pn = P, Hi < Qh < ■ ■ ■ < In = Q- 


► 3. [M24 ] Show that the correspondence defined in the proof of Theorem A can also 
be carried out by constructing a table such as this: 


Line 0 1356 

Line 1 7295 

Line 2 oo 7 oo 9 
Line 3 oo oo 

Line 4 

Here lines 0 and 1 constitute the given two-line array, 
from line k by the following procedure: 


8 

3 

5 

7 

oo 

For k > 1, line k + 1 is formed 


a) Set p ■<— oo. 

b) Let column j be the leftmost column in which line k contains an integer < p, but 
line k + 1 is blank. If no such columns exist, and if p — oo, line k + 1 is complete; 
if no such columns exist and p < oo, return to (a). 

c) Insert p into column j in line k + 1, then set p equal to the entry in column j of 
line k and return to (b). 

Once the table has been constructed in this way, row k of P consists of those integers 
in line k that are not in line ( k + 1); row k of Q consists of those integers in line 0 that 
appear in a column containing oo in line k + 1. 

► 4. [ M30 ] Let ai ... a,j-i a.j ... a n be a permutation of distinct elements, and assume 
that 1 < j < n. The permutation at . . . a,_2 a,j aj-i a-j+i • ■ • “n, obtained by inter- 
changing aj-i with a j, is called “admissible” if either 

i) j > 3 and a,j - 2 lies between a,j-\ and a } ; or 

ii) j < n and (ij+i lies between a,j - 1 and Oj. 

For example, exactly three admissible interchanges can be performed on the permuta- 
tion 1546837; we can interchange the 1 and the 5 since 1 < 4 < 5; we can interchange 
the 8 and the 3 since 3 < 6 < 8 (or since 3 < 7 < 8); but we cannot interchange the 5 
and the 4, or the 3 and the 7. 

a) Prove that an admissible interchange does not change the tableau P formed from 
the permutation by successive insertion of the elements ai , 02 , . . . , a n into an 
initially empty tableau. 

b) Conversely, prove that any two permutations that have the same P tableau can be 
transformed into each other by a sequence of one or more admissible interchanges. 
[Hint: Given that the shape of P is (rai, ri2 , . . . , n m ), show that any permuta- 
tion that corresponds to P can be transformed into the “canonical permutation” 
Prni ■ ■ ■ Pmnm ■ ■ ■ Pi\ ■ ■ ■ p2n 2 Pn . . . Pin, by a sequence of admissible interchanges.] 

► 5. [M22] Let P be the tableau corresponding to the permutation ai02 -..a n ; use 
exercise 4 to prove that P T is the tableau corresponding to a n . . . 0,2 a\. 


66 


SORTING 


5.1.4 


6 . [ M26 ] (M. P. Schiitzenberger.) Let n be an involution with k fixed points. Prove 
that the tableau corresponding to n r, in the proof of Corollary B, has exactly k columns 
of odd length. 

7. [M20] (C. Schensted.) Let P be the tableau corresponding to the permutation 
ai a -2 ■ ■ . a n . Prove that the number of columns in P is the longest length c of an 
increasing subsequence a il < a 12 < ■ ■ ■ < a,i c , where i\ < i 2 < ■ ■ ■ < i c ; the number of 
rows in P is the longest length r of a decreasing subsequence a 31 > a 32 > ■ ■ ■ > a Jr , 
where ji < j 2 < ■ ■ ■ < j T - 

8 . [M18] (P. Erdos, G. Szekeres.) Prove that any permutation containing more than 
n 2 elements has a monotonic subsequence of length greater than n; but there are 
permutations of n 2 elements with no monotonic subsequences of length greater than n. 
[Hint: See the previous exercise.] 

9. [M2J, ] Continuing exercise 8 , find a “simple” formula for the exact number of 
permutations of { 1 , 2 , ... ,n 2 } that have no monotonic subsequences of length greater 
than n. 

10. [M20] Prove that P is a tableau when Algorithm S terminates, if it was a tableau 
initially. 

11. [20] Given only the values of r and s after Algorithm S terminates, is it possible 
to restore P to its original condition? 

12. [ M24 ] How many times is step S3 performed, if Algorithm S is used repeatedly 
to delete all elements of a tableau P whose shape is (m, ri 2 , . . . , n m )? What is the 
minimum of this quantity, taken over all shapes with m + n 2 + • • • + n m = nl 

13. [M28] Prove Theorem C. 

14. [M43] Find a more direct proof of Theorem D, part (c). 

15. [M20] How many permutations of the multiset {l -a, m-b, n-c} have the property 
that, as we read the permutation from left to right, the number of c’s never exceeds the 
number of 6 ’s, and the number of b’s never exceeds the number of a’ s? (For example, 
aabcabbcaca is such a permutation.) 

16. [M08] In how many ways can the partial ordering represented by ( 39 ) be sorted 
topologically? 

17. [HM25] Let 

g(x !,x 2 ,...,x n ; y ) = xi A(xi +y,x 2 , ...,x„)+x 2 A(xi, x 2 +y, . . ,,x„) 

H 1 - x n A(xi,x 2 , . . .,x n +y). 

Prove that 

tf(xi , X 2 , . . . , Xn , y) (x\ T X 2 T • • • T Xn d - ( 2 ) I/) A {x\ ,X 2 , ■ ■ . , Xyi ) . 

[Hint: The polynomial g is homogeneous (all terms have the same total degree); and 
it is antisymmetric in the x’s (interchanging x % and x 3 changes the sign of g).] 

18. [ HM30 ] Generalizing exercise 17, evaluate the sum 

X\ A (xiTy, x 2 , . . . ,x n ) T*T 2 A (x x , x 2 +y , . . . ,Xn) T * • • T x n A(x 1 ,X 2 , . . . ,XnTy), 


when m > 0. 


5 . 1.4 


TABLEAUX AND INVOLUTIONS 


67 


19. [M40] Find a formula for the number of ways to fill an array that is like a tableau 
but with two boxes removed at the left of row 1; for example, 


rii — 2 boxes 
n 2 boxes 
«3 boxes 


is such a shape. (The rows and columns are to be in increasing order, as in ordinary 
tableaux.) 

In other words, how many tableaux of shape (ni, ri 2 , . . . , n m ) on the elements 
{1, 2, . . . ,ni+ • • • +nm} have both of the elements 1 and 2 in the first row? 

► 20. [M24] Prove that the number of ways to label the nodes of a given tree with 
the elements (1,2, ...,n}, such that the label of each node is less than that of its 
descendants, is n! divided by the product of the subtree sizes (the number of nodes in 
each subtree). For example, the number of ways to label the nodes of 



is 11!/(11 - 4- l- 5- l- 2- 31-lll) = 10-9-8-7-6. (Compare with Theorem H.) 

21. [HM31] (R. M. Thrall.) Let n\ > ri 2 > • ■ ■ > n m specify the shape of a “shifted 
tableau” where row i + 1 starts one position to the right of row i; for example, a shifted 
tableau of shape (7, 5, 4, 1) has the form of the diagram 


5 

11 

00 

7_ 

_5_ 

4_ 

1 


9 

6 

5 

3 

2 



V 

5 

4 

2 

1 




v 

1 





Prove that the number of ways to put the integers 1,2 , . . . , n = rii+ri 2 + • • • +n m into 
shifted tableaux of shape (rti , ri 2 , . . . ,n m ), so that rows and columns are in increasing 
order, is n! divided by the product of the “generalized hook lengths”; a generalized 
hook of length 11, corresponding to the cell in row 1 column 2, has been shaded in 
the diagram above. (Hooks in the “inverted staircase” portion of the array, at the left, 
have a U-shape, tilted 90°, instead of an L-shape.) Thus there are 

17!/(12 •11-8-7-5-4-1-9-6-5-3-2-5-4-211) 

ways to fill the shape with rows and columns in increasing order. 

22. [ M39 ] In how many ways can an array of shape (m, 712 , . . . , n m ) be filled with 
elements from the set {1,2,..., TV) with repetitions allowed, so that the rows are 


68 


SORTING 


5.1.4 


nondecreasing and the columns are strictly increasing? For example, the simple m- 
rowed shape (1, 1, . . . , 1) can be filled in (^) ways; the 1-rowed shape (m) can be filled 
in ( m+ ^ _1 ) ways; the small square shape (2,2) in | ( JV + 1 ) (/[) ways. 

► 23. [HM30] (D. Andre.) In how many ways, A„, can the numbers {1,2, ...,n} be 
placed into the array of n cells 


in such a way that the rows and columns are in increasing order? Find the generating 
function g(z) = A n z n /n\. 

24. [M28] Prove that 


E 


91 H l-9n=t 

0<q 1 ,...,Q„<m 


© -CK ->* 

»»“ 


[Hints: Prove that A(fci+n— 1, . . . , k n ) = A(m— k n +n— 1 , . . . , m — fcj); decompose an 
n x (m — n + 1 ) tableau in a fashion analogous to ( 38 ); and manipulate the sum as in 
the derivation of ( 36 ).] 

25. [M20] Why is ( 42 ) the generating function for involutions? 

26. [HM21 ] Evaluate x* exp(— 2x 2 / \fn ) dx when t is a nonnegative integer. 

27. [M24] Let Q be a Young tableau on {1, 2, . . . , n}; let the element i be in row r, 
and column a. We say that i is “above” j when r, < r 3 . 

a) Prove that, for 1 < i < n, i is above i + 1 if and only if Cj > °i+ 1 - 

b) Given that Q is such that (P, Q) corresponds to the permutation 


( 


1 

ai 


2 

0,2 



prove that i is above i + 1 if and only if a t > Oj+j. (Therefore we can determine 
the number of runs in the permutation, knowing only Q. This result is due to 
M. P. Schiitzenberger.) 

c) Prove that, for 1 < i < n, i is above i + 1 in Q if and only if i + 1 is above i in Q s . 

28. [M 43 ] Prove that the average length of the longest increasing subsequence of a 
random permutation of {1, 2, . . . , n) is asymptotically 2 yjn. (This is the average length 
of row 1 in the correspondence of Theorem A.) 

29. [ HM25 ] Prove that a random permutation of n elements has an increasing sub- 
sequence of length > l with probability < (") //!. This probability is 0(1/ y/n) when 
l = esjn + 0(1), and 0(exp(— c^/n )) when l = 3 y/n, c = 6 In 3 — 6 . 

30. [M41] (M. P. Schiitzenberger.) Show that the operation of going from P to P s is 
a special case of an operation applicable in connection with any finite partially ordered 
set, not merely a tableau: Label the elements of a partially ordered set with the integers 



5.1.4 


TABLEAUX AND INVOLUTIONS 


69 


{1, 2, . . . , n} in such a way that the partial order is consistent with the labeling. Find 
a dual labeling analogous to ( 26 ), by successively deleting the labels 1 , 2 , ... while 
moving the other labels in a fashion analogous to Algorithm S and placing 1, 2, ... 
in the vacated places. Show that this operation, when repeated on the dual labeling 
in reverse numerical order, yields the original labeling; and explore other properties of 
the operation. 

31. [HM30] Let x n be the number of ways to place n mutually nonattacking rooks on 
an n x n chessboard, where each arrangement is unchanged by reflection about both 
diagonals. Thus, X 4 = 6 . (Involutions are required to be symmetrical about only one 
diagonal. Exercise 5.1.3-19 considers a related problem.) Find the asymptotic behavior 
of x„. 

32. [HM21] Prove that the involution number t n is the expected value of X n , when 
X is a normal deviate with mean 1 and variance 1. 

33. [M25] (O. H. Mitchell, 1881.) True or false: A(oi, 02 , . . . ,a m )/A(l, 2, . . . , m) is 
an integer when a l5 a 2 , . . . , a m are integers. 

34. [ 25 ] (T. Nakayama, 1940.) Prove that if a tableau shape contains a hook of length 
ab, it contains a hook of length a. 

► 35. [30] (A. P. Hillman and R. M. Grassl, 1976.) An arrangement of nonnegative 
integers ptj in a tableau shape is called a plane partition of rn if J") pj 1 = m and 

Pil > ■■■ > Pini, Pij > ••• > Pn'p, for 1 < i < n'x, 1 < j < n lt 

when there are n t cells in row i and n'j cells in column j. It is called a reverse plane 
partition if instead 

pn < ■■■ < Pim, Pij <■■■ < P n 'j, for 1 < i < n'i, 1 < j < n x . 

Consider the following algorithm, which operates on reverse plane partitions of a given 
shape and constructs another array of numbers qij having the same shape: 

Gl. [Initialize.] Set qtj <— 0 for 1 < j < n t and 1 < i < n[. Then set j ■<— 1. 

G2. [Find nonzero cell.] If p n >j > 0, set i •<— ?i), k <— j , and go on to step G3. 
Otherwise if j < n\ , increase j by 1 and repeat this step. Otherwise stop (the 
p array is now zero). 

G3. [Decrease p.] Decrease pik by 1. 

G4. [Move up or right.] If i > 1 and p(*_ 1 ) *, > p t k, decrease i by 1 and return 
to G3. Otherwise if k < ni, increase A: by 1 and return to G3. 

G5. [Increase q.] Increase qij by 1 and return to G2. | 

Prove that this construction defines a one-to-one correspondence between reverse plane 
partitions of m and solutions of the equation 

m = ) ) hij <jij , 

where the numbers hij are the hook lengths of the shape, by designing an algorithm 
that recomputes the p’s from the q’s. 

36. [HM27] (R. P. Stanley, 1971.) (a) Prove that the number of reverse plane par- 
titions of m in a given shape is [z m ] 1 / fl(l — z hi i ), where the numbers hij are the 
hook lengths of the shape, (b) Derive Theorem H from this result. [Hint: What is the 
asymptotic number of partitions asm-> 00 ?] 


70 


SORTING 


5.1.4 


37. [ M20 ] (P. A. MacMahon, 1912.) What is the generating function for all plane 
partitions? (The coefficient of z m should be the total number of plane partitions of m 
when the tableau shape is unbounded.) 

38. [M30] (Greene, Nijenhuis, and Wilf, 1979.) We can construct a directed acyclic 
graph on the cells T of any given tableau shape by letting arcs run from each cell to 
the other cells in its hook; the out-degree of cell (i, j) will then be d tJ = hij — 1, where 
hij is the hook length. Suppose we generate a random path in this digraph by choosing 
a random starting cell (i,j) and choosing further arcs at random, until coming to a 
corner cell from which there is no exit. Each random choice is made uniformly. 

a) Let (a, b) be a corner cell of T, and let I = (i 0 , . . . ,i k } and J = { j 0 , . . . , j,} be 
sets of rows and columns with io <■■■< i k = a and jo <■■■< ji = b. The 
digraph contains { ^ l ) paths whose row and column sets are respectively I and J; 
let P(I, J) be the probability that the random path is one of these. Prove that 

= 1/ (n di 0 b ■ ■ ■ di k l b d a j 0 . . . d a j ,_ x ), where n = |T|. 

b) Let f(T) = n\/\\hij. Prove that the random path ends at corner (a, 6) with 
probability f(T \ {(a, b)})/f(T). 

c) Show that the result of (b) proves Theorem H and also gives us a way to generate 
a random tableau of shape T, with all f(T) tableaux equally likely. 


39. [M38] (I. M. Pak and A. V. Stoyanovskii, 1992.) Let P be an array of shape 
(ni, . . . , rim) that has been filled with any permutation of the integers {1, . . . , n}, where 

n — rii hn m . The following procedure, which is analogous to the “siftup” algorithm 

in Section 5.2.3, can be used to convert P to a tableau. It also defines an array Q of 
the same shape, which can be used to provide a combinatorial proof of Theorem H. 

PI. [Loop on (i,j).] Perform steps P2 and P3 for all cells ( i,j ) of the array, in 
reverse lexicographic order (that is, from bottom to top, and from right to 
left in each row); then stop. 


P2. [Fix P at (i,j).] Set K <— P t] and perform Algorithm S' (see below). 

P3. [Adjust Q .] Set Q ik <— Qi( k +i) + 1 for j < k < s, and set Q is i — r. | 

Here Algorithm S' is the same as Schiitzenberger’s Algorithm S, except that steps SI 
and S2 are generalized slightly: 

Si'. [Initialize.] Set r ■<— i, s i— j . 

S2'. [Done?] If K < P( r+1 ) s and K < P r ( s+1) , set P rs K and terminate. 

(Algorithm S is essentially the special case i = l, j = 1, K = oo.) 

For example, Algorithm P straightens out one particular array of shape (3, 3, 2) 
in the following way, if we view the contents of arrays P and Q at the beginning of 
step P2, with Pij in boldface type: 




5 . 1.4 


TABLEAUX AND INVOLUTIONS 


71 


The final result is 


1 

3 

4 

2 

5 

8 

6 

7 



1 

-2 

-1 

0 

-1 

0 

1 

0 



a) If P is simply alxn array, Algorithm P sorts it into 
the Q array will contain in that case. 

b) Answer the same question if P is n X 1 instead of 1 X n. 

c) Prove that, in general, we will have 


Explain what 


bij < Q ij ^ f'ij , 

where bij is the number of cells below (i,j) and r ZJ is the number of cells to the 
right. Thus, the number of possible values for Q,j is exactly hij, the size of the 
(i,j ) th hook. 

d) Theorem H will be proved constructively if we can show that Algorithm P defines 
a one-to-one correspondence between the n! ways to fill the original shape and the 
pairs of output arrays ( P,Q ), where P is a tableau and the elements of Q satisfy 
the condition of part (c). Therefore we want to find an inverse of Algorithm P. For 
what initial permutations does Algorithm P produce the 2x2 array Q = (° ~ x )? 

e) What initial permutation does Algorithm P convert into the arrays 


1 

3 

5 

7 

11 

15 

2 

6 

8 

14 


4 

9 

13 



10 

12 




16 






-2 

-3 

-1 

-1 

1 

0 

co 

-2 

-1 

0 


0 

-1 

0 



-1 

0 




0 






f) Design an algorithm that inverts Algorithm P, given any pair of arrays ( P , Q) 
such that P is a tableau and Q satisfies the condition of part (c) . [Hint: Construct 
an oriented tree whose vertices are the cells {i, j), with arcs 


(*> 3) -> ~ 1) if Pi(j- i) > P(i-i)j\ 

^ fjj) if Pi(j-l) ^ — 


In the example of part (e) we have the tree 


A' A 


* * 


The paths of this tree hold the key to inverting Algorithm P.] 

40. [HM43] Suppose a random Young tableau has been constructed by successively 
placing the numbers 1, 2, . . . , n in such a way that each possibility is equally likely 
when a new number is placed. For example, the tableau ( 1 ) would be obtained with 
probability I.l.l.I.I.I.l.I.I.i.i.I.I.I.I using this procedure. 

Prove that, with high probability, the resulting shape (m, ri 2 , • • ■ , n m ) will have 
m ~ \/6 n and •Jk + yTifc+i ~ y/m for 0 < k < m. 



72 


SORTING 


5.1.4 


41 . [ 25 ] ( Disorder in a library.) Casual users of a library often put books back on the 
shelves in the wrong place. One way to measure the amount of disorder present in a 
library is to consider the minimum number of times we would have to take a book out 
of one place and insert it in another, before all books are restored to the correct order. 

Thus let 7r = aj a 2 . . . a„ be a permutation of {1, 2 , . . . , n}. A “deletion-insertion 
operation” changes n to 

a i ■ ■ ■ fli- 1 “t+i • • • a,j ai aj+i . . . a n or ai . . . aj o; aj+i . . . ai-i 1 . . . a n , 

for some i and j. Let dis(7r) be the minimum number of deletion-insertion operations 
that will sort n into order. Can dis(7r) be expressed in terms of simpler characteristics 
of 7T ? 

► 42 . [ 30 ] ( Disorder in a genome.) The DNA of Lobelia fervens has genes occur- 
ring in the sequence g R 9i 9204050306* , where g R stands for the left-right reflection 
of gr, the same genes occur in tobacco plants, but in the order g i<72<?3 <7495 < 76 97 • Show 
that five “flip” operations on substrings are needed to get from 01020304950607 to 
<77*<h <7294 <789356* ■ (A flip takes a( 3 y to a/ 3 R -y, when a, 0 , and 7 are strings.) 

43 . [ 35 ] Continuing the previous exercise, show that at most n + 1 flips are needed 
to sort any rearrangement of g\g2 ■ ■ ■ g n - Construct examples that require n + 1 flips, 
for all n > 3. 

44 . [M 37 ] Show that the average number of flips required to sort a random arrange- 
ment of n genes is greater than n — H n , if all 2 n n\ genome rearrangements are equally 
likely. 


5.2 


INTERNAL SORTING 


73 


5.2. INTERNAL SORTING 

Let’s BEGIN our discussion of good “sortsmanship” by conducting a little ex- 
periment. How would you solve the following programming problem? 

“Memory locations R+l, R+2, R+3, R+4, and R+5 contain five numbers. 
Write a computer program that rearranges these numbers, if necessary, 
so that they are in ascending order.” 

(If you already are familiar with some sorting methods, please do your best to 
forget about them momentarily; imagine that you are attacking this problem for 
the first time, without any prior knowledge of how to proceed.) 

Before reading any further, you are requested to construct a solution to this 
problem. 


The time you spent working on the challenge problem will pay dividends 
as you continue to read this chapter. Chances are your solution is one of the 
following types: 

A. An insertion sort. The items are considered one at a time, and each new 
item is inserted into the appropriate position relative to the previously-sorted 
items. (This is the way many bridge players sort their hands, picking up one 
card at a time.) 

B. An exchange sort. If two items are found to be out of order, they are 
interchanged. This process is repeated until no more exchanges are necessary. 

C. A selection sort. First the smallest (or perhaps the largest) item is lo- 
cated, and it is somehow separated from the rest; then the next smallest (or next 
largest) is selected, and so on. 

D. An enumeration sort. Each item is compared with each of the others; an 
item’s final position is determined by the number of keys that it exceeds. 

E. A special-purpose sort, which works nicely for sorting five elements as 
stated in the problem, but does not readily generalize to larger numbers of items. 

F. A lazy attitude, with which you ignored the suggestion above and decided 
not to solve the problem at all. Sorry, by now you have read too far and you 
have lost your chance. 

G. A new, super sorting technique that is a definite improvement over known 
methods. (Please communicate this to the author at once.) 

If the problem had been posed for, say, 1000 items, not merely 5, you might 
also have discovered some of the more subtle techniques that will be mentioned 
later. At any rate, when attacking a new problem it is often wise to find some 
fairly obvious procedure that works, and then try to improve upon it. Cases A, B, 
and C above lead to important classes of sorting techniques that are refinements 
of the simple ideas stated. 

Many different sorting algorithms have been invented, and we will be dis- 
cussing about 25 of them in this book. This rather alarming number of methods 
is actually only a fraction of the algorithms that have been devised so far; 
many techniques that are now obsolete will be omitted from our discussion, or 



74 


SORTING 


5.2 


mentioned only briefly. Why are there so many sorting methods? For computer 
programming, this is a special case of the question, “Why are there so many x 
methods?” , where x ranges over the set of problems; and the answer is that each 
method has its own advantages and disadvantages, so that it outperforms the 
others on some configurations of data and hardware. Unfortunately, there is no 
known “best” way to sort; there are many best methods, depending on what 
is to be sorted on what machine for what purpose. In the words of Rudyard 
Kipling, “There are nine and sixty ways of constructing tribal lays, and every 
single one of them is right.” 

It is a good idea to learn the characteristics of each sorting method, so that 
an intelligent choice can be made for particular applications. Fortunately, it is 
not a formidable task to learn these algorithms, since they are interrelated in 
interesting ways. 

At the beginning of this chapter we defined the basic terminology and 
notation to be used in our study of sorting: The records 

Ri,R 2 , ■ ■ . ,Rn ( i ) 

are supposed to be sorted into nondecreasing order of their keys K\,K 2 , ■ ■ . , Kn, 
essentially by discovering a permutation p(l)p(2) . . .p(N) such that 

Rp ( i ) ^ Rp( 2 ) — ■ ■ ■ — k p[N ). ( 2 ) 

In the present section we are concerned with internal sorting, when the number 
of records to be sorted is small enough that the entire process can be performed 
in a computer’s high-speed memory. 

In some cases we will want the records to be physically rearranged in memory 
so that their keys are in order, while in other cases it may be sufficient merely 
to have an auxiliary table of some sort that specifies the permutation. If the 
records and/or the keys each take up quite a few words of computer memory, 
it is often better to make up a new table of link addresses that point to the 
records, and to manipulate these link addresses instead of moving the bulky 
records around. This method is called address table sorting (see Fig. 6). If the 
key is short but the satellite information of the records is long, the key may be 
placed with the link addresses for greater speed; this is called keysorting. Other 
sorting schemes utilize an auxiliary link field that is included in each record; 
these links are manipulated in such a way that, in the final result, the records 
are linked together to form a straight linear list, with each link pointing to the 
following record. This is called list sorting (see Fig. 7). 

After sorting with an address table or list method, the records can be re- 
arranged into increasing order as desired. Exercises 10 and 12 discuss interesting 
ways to do this, requiring only enough additional memory space to hold one 
record; alternatively, we can simply move the records into a new area capable 
of holding all records. The latter method is usually about twice as fast as the 
former, but it demands nearly twice as much storage space. Many applications 
can get by without moving the records at all, since the link fields are often 
adequate for all of the subsequent processing. 


5.2 


INTERNAL SORTING 


75 


Ri R2 R3 



Fig. 6. Address table sorting. 


R\ R 2 R 3 



Fig. 7. List sorting. 

All of the sorting methods that we shall examine in depth will be illustrated 
in four ways, by means of 

a) an English-language description of the algorithm, 

b) a flow diagram, 

c) a MIX program, and 

d) an example of the sorting method applied to a certain set of 16 numbers. 

For convenience, the MIX programs will usually assume that the key is numeric 
and that it fits in a single word; sometimes we will even restrict the key to part 
of a word. The order relation < will be ordinary arithmetic order; and the record 
will consist of the key alone, with no satellite information. These assumptions 
make the programs shorter and easier to understand, and a reader should find 
it fairly easy to adapt any of the programs to the general case by using address 
table sorting or list sorting. An analysis of the running time of each sorting 
algorithm will be given with the MIX programs. 

Sorting by counting. As a simple example of the way in which we shall study 
internal sorting methods, let us consider the “counting” idea mentioned near 
the beginning of this section. This simple method is based on the idea that the 
jth key in the final sorted sequence is greater than exactly j — 1 of the other 
keys. Putting this another way, if we know that a certain key exceeds exactly 
27 others, and if no two keys are equal, the corresponding record should go into 








76 


SORTING 


5.2 


position 28 after sorting. So the idea is to compare every pair of keys, counting 
how many are less than each particular one. 

The obvious way to do the comparisons is to 

((compare Kj with Ki) for 1 < j < TV) for 1 < * < A; 

but it is easy to see that more than half of these comparisons are redundant, 
since it is unnecessary to compare a key with itself, and it is unnecessary to 
compare K a with Kf, and later to compare Kb with K a . We need merely to 

((compare Kj with Ki) for 1 < j < i) for 1 < i < N. 

Hence we are led to the following algorithm. 

Algorithm C ( Comparison counting). This algorithm sorts i?i, . . . , R.n on the 
keys Ki , . . . , Ajv by maintaining an auxiliary table COUNT [1] , . . . , COUNT [A] to 
count the number of keys less than a given key. After the conclusion of the 
algorithm, COUNT [ j] + 1 will specify the final position of record Rj . 

Cl. [Clear COUNTs.] Set COUNT [1] through COUNT [A] to zero. 

C2. [Loop on *.] Perform step C3, for * = A, A — 1, . . . , 2; then terminate the 
algorithm. 

C3. [Loop on j] Perform step C4, for j = i — 1, i — 2, . . . , 1. 

C4. [Compare K t : Kj.} If A, < Kj, increase COUNT [j] by 1; otherwise increase 
COUNT [i] by 1. | 

Note that this algorithm involves no movement of records. It is similar to 
an address table sort, since the COUNT table specifies the final arrangement of 
records; but it is somewhat different because COUNT [j] tells us where to move 
Rj, instead of indicating which record should be moved into the place of Rj. 
(Thus the COUNT table specifies the inverse of the permutation p( 1) . . ,p(A); see 
Section 5.1.1.) 

Table 1 illustrates the typical behavior of comparison counting, by applying 
it to 16 numbers that were chosen at random by the author on March 19, 1963. 
The same 16 numbers will be used to illustrate almost all of the other methods 
that we shall discuss later. 

In our discussion preceding this algorithm we blithely assumed that no two 
keys were equal. This was a potentially dangerous assumption, for if equal 
keys corresponded to equal COUNTS the final rearrangement of records would be 
quite complicated. Fortunately, however, Algorithm C gives the correct result 
no matter how many equal keys are present; see exercise 2. 

Program C ( Comparison counting). The following MIX implementation of 
Algorithm C assumes that Rj is stored in location INPUT + j, and COUNT [j] 


in 

rX 

location COUNT + j, for 
- COUNT [i]. 

1 < j < A; rll = *; rI2 = j; rA = Ki = Rp 

01 

START ENT1 N 

1 Cl. Clear COUNTS. 

02 

STZ COUNT, 1 

A COUNT [i] <- 0. 

03 

DEC1 1 

A 

04 

J1P *-2 

A A > i > 0. 


5.2 


INTERNAL SORTING 77 


Table 1 


SORTING BY COUNTING (ALGORITHM C) 


KEYS: 

503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 

COUNT (init.): 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

COUNT ( i = N ): 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

12 

COUNT (i = TV- 1): 

0 

0 

0 

0 

2 

0 

2 

0 

0 

0 

0 

0 

0 

0 

13 

12 

COUNT ( i = TV- 2): 

0 

0 

0 

0 

3 

0 

3 

0 

0 

0 

0 

0 

0 

11 

13 

12 

COUNT (i = TV — 3): 

0 

0 

0 

0 

4 

0 

4 

0 

1 

0 

0 

0 

9 

11 

13 

12 

COUNT ( i = TV-4): 

0 

0 

1 

0 

5 

0 

5 

0 

2 

0 

0 

7 

9 

11 

13 

12 

COUNT ( i = AT-5): 

1 

0 

2 

0 

6 

1 

6 

1 

3 

1 

2 

7 

9 

11 

13 

12 

COUNT (i = 2): 

6 

1 

8 

0 

15 

3 

14 

4 

10 

5 

2 

7 

9 

11 

13 

12 



Fig. 8. Algorithm C: Comparison counting. 


05 


ENT1 

N 

1 

C2. Loop on i. 

06 


JMP 

IF 

1 


07 

2H 

LDA 

INPUT, 1 

TV- 1 


08 


LDX 

COUNT, 1 

TV- 1 


09 

3H 

CMPA 

INPUT, 2 

A 

C4. Compare Ki : Ki. 

10 


JGE 

4F 

A 

Jump if Ki > Kj. 

11 


LD3 

COUNT, 2 

B 

COUNT [j] 

12 


INC3 

1 

B 

+1 

13 


ST3 

COUNT, 2 

B 

— > COUNT [j]. 

U 


JMP 

5F 

B 


15 

4H 

INCX 

1 

1 

to 

COUNT [i] <- COUNT [i] + 1. 

16 

5H 

DEC2 

1 

A 

C3. Loop on i. 

17 


J2P 

3B 

A 


18 


STX 

COUNT, 1 

TV-1 


19 


DEC1 

1 

AT- 1 


20 

1H 

ENT2 

-1,1 

N 

TV > i > j > 0. 

21 


J2P 

2B 

N 

1 


The running time of this program is 137V + 6A + 5B - 4 units, where TV is 
the number of records; A is the number of choices of two things from a set of 
N objects, namely (^) = (TV 2 — N)/2; and B is the number of pairs of indices 
for which j < i and Kj > K{. Thus, B is the number of inversions of the 
permutation K\ ... K^; this is the quantity that was analyzed extensively in 
Section 5.1.1, where we found in Eqs. 5.1.1-(i2) and 5.1.1-(i3) that, for unequal 
keys in random order, we have 

B = (min 0, ave (7V 2 -7V)/4, max ( N 2 -N)/2 , dev y/N(N - 1)(7 V + 2.5)/6). 





78 


SORTING 


5.2 


Hence Program C requires between 37V 2 + 107V — 4 and 5.57V 2 + 7.57V — 4 units 
of time, and the average running time lies halfway between these two extremes. 
For example, the data in Table 1 has TV = 16, A = 120, B = 41, so Program C 
will sort it in 1129u. See exercise 5 for a modification of Program C that has 
slightly different timing characteristics. 

The factor TV 2 that dominates this running time shows that Algorithm C 
is not an efficient way to sort when TV is large; doubling the number of records 
increases the running time fourfold. Since the method requires a comparison of 
all distinct pairs of keys ( Ki,Kj ), there is no apparent way to get rid of the 
dependence on TV 2 , although we will see later in this chapter that the worst-case 
running time for sorting can be reduced to order TV log TV using other techniques. 
Our main interest in Algorithm C is its simplicity, not its speed. Algorithm C 
serves as an example of the style in which we will be describing more complex 
(and more efficient) methods. 

There is another way to sort by counting that is quite important from the 
standpoint of efficiency; it is primarily applicable in the case that many equal 
keys are present, and when all keys fall into the range u < Kj < v, where (v — u ) 
is small. These assumptions appear to be quite restrictive, but in fact we shall 
see quite a few applications of the idea. For example, if we apply this method 
to the leading digits of keys instead of applying it to entire keys, the file will be 
partially sorted and it will be comparatively simple to complete the job. 

In order to understand the principles involved, suppose that all keys lie 
between 1 and 100. In one pass through the file we can count how many Is, 2s, 
. . . , 100s are present; and in a second pass we can move the records into the 
appropriate place in an output area. The following algorithm spells things out 
in complete detail: 

Algorithm D ( Distribution counting). Assuming that all keys are integers in 
the range u < Kj < v for 1 < j < TV, this algorithm sorts the records i?i , . . . , 
by making use of an auxiliary table COUNT [u] , . . . , COUNT [-/;] . At the conclusion 
of the algorithm the records are moved to an output area Si, ... , Sn in the 
desired order. 

Dl. [Clear COUNTS.] Set COUNT [u] through COUNT [u] all to zero. 

D2. [Loop on j.] Perform step D3 for 1 < j < TV; then go to step D4. 

D3. [I ncrease COUNT IKj] .] Increase the value of COUNT [TV,] by 1. 

D4. [Accumulate.] (At this point COUNT [i] is the number of keys that are equal 
to i.) Set COUNT [*] 4- COUNT [*] + COUNT [« - 1] , for i = u + 1, u + 2, . . . , v. 

D5. [Loop on j.] (At this point COUNT [?'] is the number of keys that are less than 
or equal to i ; in particular, COUNT [u] = TV.) Perform step D6 for j = TV, 
TV — 1, . . . , 1; then terminate the algorithm. 

D6. [Output R v \ Set i 4- COUNT [A,] , S, 4- Rj, and COUNT [A,] 4- i - 1. | 

An example of this algorithm is worked out in exercise 6; a MIX program appears 
in exercise 9. When the range v — u is small, this sorting procedure is very fast. 


5.2 


INTERNAL SORTING 


79 



Fig. 9. Algorithm D: Distribution counting. 

Sorting by comparison counting as in Algorithm C was first mentioned in 
print by E. H. Friend [JACM 3 (1956), 152], although he didn’t claim it as his 
own invention. Distribution sorting as in Algorithm D was first developed by 
H. Seward in 1954 for use with radix sorting techniques that we will discuss 
later (see Section 5.2.5); it was also published under the name “Mathsort” by 
W. Feurzeig, CACM 3 (1960), 601. 

EXERCISES 

1. [15] Would Algorithm C still work if i varies from 2 up to N in step C2, instead 
of from N down to 2? What if j varies from 1 up to i - 1 in step C3? 

2. [21] Show that Algorithm C works properly when equal keys are present. If 
Kj = Ki and j < i, does Rj come before or after A, in the final ordering? 

► 3. [21] Would Algorithm C still work properly if the test in step C4 were changed 
from 11 Ki < Kj" to “Ki < Kj"? 

4. [16] Write a MIX program that “finishes” the sorting begun by Program C; your 
program should transfer the keys to locations 0UTPUT+1 through OUTPUT+N, in ascending 
order. How much time does your program require? 

5. [22] Does the following set of changes improve Program C? 

New line 08a: INCX 0,2 
Change line 10: JGE 5F 
Change line 14: DECX 1 
Delete line 15. 

6. [18] Simulate Algorithm D by hand, showing intermediate results when the 16 
records 5T, 0C, 5U, 00, 9., IN, 8S, 2R, 6A, 4A, 1G, 5L, 6T, 61, 70, 7N are being sorted. 
Here the numeric digit is the key, and the alphabetic information is just carried along 
with the records. 

7. [13] Is Algorithm D a stable sorting method? 

8. [15] Would Algorithm D still work properly if j were to vary from 1 up to N in 
step D5, instead of from N down to 1? 

9. [23] Write a MIX program for Algorithm D, analogous to Program C and exercise 4. 
What is the execution time of your program, as a function of N and ( v — u)l 

10. [25] Design an efficient algorithm that replaces the N quantities (fii, . . . , R N ) by 
(Rp(i)t • ■ • i R-p{ N ) ) i respectively, given the values of R \ , . . . , Rn and the permutation 




80 


SORTING 


5.2 


P( 1) • • ■ P(N) of {1, . . . , N}. Try to avoid using excess memory space. (This problem 
arises if we wish to rearrange records in memory after an address table sort, without 
having enough room to store 2 N records.) 

11. [M27] Write a MIX program for the algorithm of exercise 10, and analyze its 
efficiency. 

► 12. [25] Design an efficient algorithm suitable for rearranging the records R^,...,R N 
into sorted order, after a list sort (Fig. 7) has been completed. Try to avoid using 
excess memory space. 

► 13. [27] Algorithm D requires space for 2 N records Ri,...,R n and Si , . . . , S N . Show 
that it is possible to get by with only N records Ri , . . . , R N , if a new unshuffiing 
procedure is substituted for steps D5 and D6. (Thus the problem is to design an 
algorithm that rearranges Ri , . . . , R N in place, based on the values of COUNT [u] , . . . , 
COUNT [v] after step D4, without using additional memory space; this is essentially a 
generalization of the problem considered in exercise 10.) 

5.2.1. Sorting by Insertion 

One of the important families of sorting techniques is based on the “bridge 
player method mentioned near the beginning of Section 5.2: Before examining 
record Rj, we assume that the preceding records 7?i, . . .,R J _ 1 have already 
been sorted; then we insert Rj into its proper place among the previously sorted 
records. Several interesting variations on this basic theme are possible. 

Straight insertion. The simplest insertion sort is the most obvious one. 
Assume that 1 < j < N and that records R \ , . . . , Rj_ j have been rearranged so 
that 

Ki < K 2 < ■■■ < Kj _j. 

(Remember that, throughout this chapter, Kj denotes the key portion of Rj.) 
We compare the new key Kj with Kj. u Kj_ 2 , . . . , in turn, until discovering 
that Rj should be inserted between records R. t and R i+1 ; then we move records 
Ri+ 1, • ■ • ) Rj - 1 up one space and put the new record into position i + 1. It is 
convenient to combine the comparison and moving operations, interleaving them 
as shown in the following algorithm; since Rj “settles to its proper level” this 
method of sorting has often been called the sifting or sinking technique. 



Fig. 10. Algorithm S: Straight insertion. 


Algorithm S ( Straight insertion sort). Records R \ . . . . , R\ are rearranged in 
place; after sorting is complete, their keys will be in order, Ki < ■ • ■ < K N . 






5.2.1 


SORTING BY INSERTION 


81 


51. [Loop on j.] Perform steps S2 through S5 for j = 2, 3, . . . , JV; then terminate 
the algorithm. 

52. [Set up i, K, 12.] Set i «— j — 1, K t— Kj, R t— Rj. (In the following steps 
we will attempt to insert R into the correct position, by comparing K with 
Ki for decreasing values of i.) 

53. [Compare K : K^.] If K > Ki, go to step S5. (We have found the desired 
position for record R.) 

54. [Move Ri, decrease i] Set R i+X R { , then i <- i - 1. If i > 0, go back to 
step S3. (If i = 0, K is the smallest key found so far, so record R belongs in 
position 1.) 

55. [R into Ri +X .] Set R i+1 R. | 

Table 1 shows how our sixteen example numbers are sorted by Algorithm S. This 

method is extremely easy to implement on a computer; in fact the following MIX 

program is the shortest decent sorting routine in this book. 

Table 1 

EXAMPLE OF STRAIGHT INSERTION 

503 : 087 

A 

087 503:512 

A 

087 503 512:061 

A 

061 087 503 512:908 

A 

061 087 503 512 908 : 170 

A 

061 087 170 503 512 908:897 


061 087 154 170 275 426 503 509 512 612 653 677 765 897 908:703 

A 

061 087 154 170 275 426 503 509 512 612 653 677 703 765 897 908 


Program S ( Straight insertion sort). The records to be sorted are in locations 
INPUT+1 through INPUT+N; they are sorted in place in the same area, on a full- 


word key. rll = j 

< - N; rI2 = 

i; rA = R 

= K- 

assume that N > 2. 

01 

START 

ENT1 

2-N 

1 


SI. Loop on i f «— 2. 

02 

2H 

LDA 

INPUT+N ,1 

N- 

1 

S2. Set up i. K. R. 

03 


ENT2 

N-1,1 

N- 

1 

i «- j - 1. 

04 

3H 

CMPA 

INPUT, 2 

B + N- 

1- A 

S3. Compare K : Ki. 

05 


JGE 

5F 

B + N- 

1 -A 

To S5 if K > Ki. 

06 

4H 

LDX 

INPUT, 2 

B 


S4. Move Ri. decrease i. 

07 


STX 

INPUT+1, 2 

B 


Ri+ 1 Ri- 

08 


DEC2 

1 

B 


i «— i — 1. 

09 


J2P 

3B 

B 


To S3 if i > 0. 

10 

5H 

STA 

INPUT+1, 2 

N — 

1 

S5. R into Ri+i . 

11 


INC1 

1 

N — 

1 


12 


J1NP 

2B 

N- 

1 

2 <j<N. | 


82 


SORTING 


5.2.1 


The running time of this program is 9 B + 107V — 3^4 — 9 units, where TV is 
the number of records sorted, A is the number of times i decreases to zero in 
step S4, and B is the number of moves. Clearly A is the number of times 
Kj < min(jFfi, . . . , Kj- 1 ) for 1 < j < TV; this is one less than the number of left- 
to-right minima, so A is equivalent to the quantity that was analyzed carefully 
in Section 1.2.10. Some reflection shows us that B is also a familiar quantity: 
The number of moves for fixed j is the number of inversions of Kj, so B is 
the total number of inversions of the permutation K\ K 2 . . . K^. Hence by Eqs. 
1.2.10-(i6), 5.1.1-(i2), and 5.1.1-(i3), we have 

A = (minO, ave H jv — 1, max N — 1, dev \J H jv — H ) ; 

B — (minO, ave ( N 2 — N)/ 4, max ( N 2 — N)/ 2, dev y/N(N - 1)(7V + 2.5)/6); 

and the average running time of Program S, assuming that the input keys are 
distinct and randomly ordered, is (2.257V 2 + 7.757V — 3 — 6)w. Exercise 33 
explains how to improve this slightly. 

The example data in Table 1 involves 16 items; there are two changes to the 
left-to-right minimum, namely 087 and 061; and there are 41 inversions, as we 
have seen in the previous section. Hence TV = 16, A = 2, B — 41, and the total 
sorting time is 514rt. 

Binary insertion and two-way insertion. While the jth record is being 
processed during a straight insertion sort, we compare its key with about j/2 
of the previously sorted keys, on the average; therefore the total number of 
comparisons performed comes to roughly (1 + 2 + ■ • • + 7V)/2 ss 7V 2 /4, and this 
gets very large when TV is only moderately large. In Section 6.2.1 we shall 
study “binary search” techniques, which show where to insert the jth item 
after only about lgj well-chosen comparisons have been made. For example, 
when inserting the 64th record we can start by comparing iV 64 with K 3 2 ; if it 
is less, we compare it with K i 6 , but if it is greater we compare it with K 48 , 
etc., so that the proper place to insert Rq 4 will be known after making only six 
comparisons. The total number of comparisons for inserting all TV items comes 
to about TVlgTV, a substantial improvement over jTV 2 ; and Section 6.2.1 shows 
that the corresponding program need not be much more complicated than a 
program for straight insertion. This method is called binary insertion ; it was 
mentioned by John Mauchly as early as 1946, in the first published discussion 
of computer sorting. 

The unfortunate difficulty with binary insertion is that it solves only half 
of the problem; after we have found where record Rj is to be inserted, we still 
need to move about ~j of the previously sorted records in order to make room 
for Rj, so the total running time is still essentially proportional to TV 2 . Some 
early computers such as the IBM 705 had a built-in “tumble” instruction that did 
such move operations at high speed, and modern machines can do the moves even 
faster with special hardware attachments; but as TV increases, the dependence 
on TV 2 eventually takes over. For example, an analysis by H. Nagler [CACM 3 


5.2.1 


SORTING BY INSERTION 


83 


(1960), 618-620] indicated that binary insertion could not be recommended for 
sorting more than about N = 128 records on the IBM 705, when each record 
was 80 characters long, and similar analyses apply to other machines. 

Of course, a clever programmer can think of various ways to reduce the 
amount of moving that is necessary; the first such trick, proposed early in the 
1950s, is illustrated in Table 2. Here the first item is placed in the center of an 
output area, and space is made for subsequent items by moving to the right or 
to the left, whichever is most convenient. This saves about half the running t im e 
of ordinary binary insertion, at the expense of a somewhat more complicated 
program. It is possible to use this method without using up more space than 
required for N records (see exercise 6); but we shall not dwell any longer on t his 
“two-way” method of insertion, since considerably more interesting techniques 
have been developed. 


Table 2 


TWO-WAY INSERTION 




A 

503 






087 

503 






087 

A 

503 

512 




061 

087 

503 

512 




061 

087 

503 

512 

908 


061 

087 

170 

503 

512 

908 


061 

087 

170 A 

503 

512 

897 

908 

061 087 

170 

275 

503 

512 

897 

908 


Shell’s method. If we have a sorting algorithm that moves items only one 
position at a time, its average time will be, at best, proportional to N 2 , since 
each record must travel an average of about |lV positions during the sorting 
process (see exercise 7). Therefore, if we want to make substantial improvements 
over straight insertion, we need some mechanism by which the records can take 
long leaps instead of short steps. 

Such a method was proposed in 1959 by Donald L. Shell [CACM 2,7 
(July 1959), 30-32], and it became known as shellsort. Table 3 illustrates the 
general idea behind the method: First we divide the 16 records into 8 groups 
of two each, namely (R±, Rq), (R 2 , Rio), ■ ■ ■ , (Rs, Rm)- Sorting each group of 
records separately takes us to the second line of Table 3; this is called the “first 
pass.” Notice that 154 has changed places with 512; 908 and 897 have both 
jumped to the right. Now we divide the records into 4 groups of four each, 
namely (i?i, R 5 , Rg, R 13 ), . . . , (R 4 , R 8 , R 12 , R is), and again each group is sorted 
separately; this “second pass” takes us to line 3. A third pass sorts two groups 
of eight records, then a fourth pass completes the job by sorting all 16 records. 
Each of the intermediate sorting processes involves either a comparatively short 
file or a file that is comparatively well ordered, so straight insertion can be used 


84 SORTING 


5.2.1 


Table 3 

SHELLSORT WITH INCREMENTS 8, 4, 2, 1 


503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 

8 -sort: 

503 087 154 061 612 170 765 275 653 426 512 509 908 677 897 703 

4 -~.: 

503 087 154 061 612 170 512 275 653 426 765 509 908 677 897 703 


2 -sort: 


1 -sort: 


154 061 503 087 512 170 612 275 653 426 765 509 897 677 908 703 
061 087 154 170 275 426 503 509 512 612 653 677 703 765 897 908 


for each sorting operation. In this way the records tend to converge quickly to 
their final destinations. 

Shellsort is also known as the “diminishing increment sort,” since each pass 
is defined by an increment h such that we sort the records that are h units apart. 
The sequence of increments 8, 4, 2, 1 is not sacred; indeed, any sequence h t - 1 , 
h t - 2 , ■■ ■ ,h 0 can be used, so long as the last increment h 0 equals 1. For example, 
Table 4 shows the same data sorted with increments 7, 5, 3, 1. Some sequences 
are much better than others; we will discuss the choice of increments later. 

Algorithm D ( Shellsort ). Records R \ , . . . , R^ are rearranged in place; after 
sorting is complete, their keys will be in order, K\ < < K N . An auxiliary 

sequence of increments ht-i, h t - 2 , . . . , ho is used to control the sorting process, 
where h 0 — 1; proper choice of these increments can significantly decrease the 
sorting time. This algorithm reduces to Algorithm S when t = 1. 

Dl. [Loop on s.j Perform step D2 for s = t - 1, t - 2, . . . , 0; then terminate the 
algorithm. 

D2. [Loop on j .] Set h 4 — h s , and perform steps D3 through D6 for h < j < N. 
(We will use a straight insertion method to sort elements that are h positions 
apart, so that K t < K i+h for 1 < i < N - h. Steps D3 through D6 are 
essentially the same as steps S2 through S5, respectively, in Algorithm S.) 
D3. [Set up i, K, f?.] Set i <- j — h, K Kj, R Rj. 

D4. [Compare K : Ki] If K > K t , go to step D6. 

D5. [Move Ri, decrease i.} Set R i+h <- R t , then i <- * - h. If i > 0, go back to 
step D4. 

D6. [ R into Ri+h\ Set R i+ h *— R. | 

The corresponding MIX program is not much longer than our program for 
straight insertion. Lines 08-19 of the following code are a direct translation of 
Program S into the more general framework of Algorithm D. 

Program D ( Shellsort ). We assume that the increments are stored in an 
auxiliary table, with h s in location H + s; all increments are less than N. Register 


5.2.1 


SORTING BY INSERTION 


85 


Table 4 

SHELLSORT WITH INCREMENTS 7, 5, 3, 1 


503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 

7-sort: 

275 087 426 061 509 170 677 503 653 512 154 908 612 897 765 703 

5-s° rt : 

154 087 426 061 509 170 677 503 653 512 275 908 612 897 765 703 


3-sort: 


1-sort: 


061 087 170 154 275 426 512 503 653 612 509 765 677 897 908 703 
061 087 154 170 275 426 503 509 512 612 653 677 703 765 897 908 


assignments: rll ee j - N; rI2 = i; rA = R = K; rI3 = s; rI4 = h. Note that this 
program modifies itself, in order to obtain efficient execution of the inner loop. 


01 

START 

ENT3 

T-l 

1 

Dl. Loop on s. s i — t — 1. 

02 

1H 

LD4 

H,3 

T 

D2. Loop on 7 . h <— h«. 

03 


ENT1 

INPUT, 4 

T 

Modify the addresses of three 

04 


ST1 

5F (0 : 2) 

T 

instructions in the main loop. 

05 


ST1 

6F(0:2) 

T 

06 


ENN1 

-N,4 

T 

rll <- N - h. 

07 


ST1 

3F (0 : 2) 

T 


08 


ENT1 

1-N.4 

T 

j <- h + 1. 

09 

2H 

LDA 

INPUT+N, 1 

NT -S 

D3. Set up i, K. R. 

10 

3H 

ENT2 

N-H.l 

NT -S 

i «- j - h. [Instruction modified] 

11 

4H 

CMPA 

INPUT, 2 

B+NT-S-A 

D4. Compare K : K,. 

12 


JGE 

6F 

B+NT-S-A 

To D6 if K > Ki. 

13 


LDX 

INPUT, 2 

B 

D5. Move Ft , , decrease i. 

H 

5H 

STX 

INPUT+H.2 

B 

Ri+h <— Ri • [Instruction modified] 

15 


DEC2 

0,4 

B 

i <— i — h. 

16 


J2P 

4B 

B 

To D4 if i > 0. 

17 

6H 

STA 

INPUT+H, 2 

NT - S 

D6. R into Ri j. h . [Instruction modified] 

18 

7H 

INC1 

1 

NT- S 

J+-3 + 1- 

19 


J1NP 

2B 

NT -S 

To D3 if j < N. 

20 


DEC3 

1 

T 


21 


J3NN 

IB 

T 

t > s > 0. | 


* Analysis of shellsort. In order to choose a good sequence of increments 
ht~i, . . . ,ho for use in Algorithm D, we need to analyze the running time as 
a function of those increments. This leads to some fascinating mathematical 
problems, not yet completely resolved; nobody has been able to determine 
the best possible sequence of increments for large values of N. Yet a good 
many interesting facts are known about the behavior of shellsort, and we will 
summarize them here; details appear in the exercises below. [Readers who are 
not mathematically inclined should skim over the next few pages, continuing 
with the discussion of list insertion following ( 12 ).] 


86 


SORTING 


5.2.1 


The frequency counts shown with Program D indicate that five factors 
determine the execution time: the size of the file, IV; the number of passes 
(that is, the number of increments), T = t; the sum of the increments, 

S = ho + • • • + hi- 1; 

the number of comparisons, B + NT — S - A; and the number of moves, B. As 
in the analysis of Program S, A is essentially the number of left-to-right minima 
encountered in the intermediate sorting operations, and B is the number of 
inversions in the subfiles. The factor that governs the running time is B, so we 
shall devote most of our attention to it. For purposes of analysis we shall assume 
that the keys are distinct and initially in random order. 

Let us call the operation of step D2 “h-sorting,” so that shellsort consists 
of /q-x-sorting, followed by h t _ 2 sorting, . . . , followed by /i 0 -sorting. A file in 
which Ki < K l+ h for 1 < i < N — h will be called “h-ordered.” 

Consider first the simplest generalization of straight insertion, when there 
are just two increments, hj = 2 and h 0 = 1. In this case the second pass begins 
with a 2-ordered sequence of keys, K x K 2 . ■ . K N . It is easy to see that the n um ber 
of permutations ai a 2 . . . a n of {1,2,..., n} having a t < a i+2 for 1 < * < n - 2 is 

(l»/2j)’ 

since we obtain exactly one 2-ordered permutation for each choice of Ln/2j 
elements to put in the even-numbered positions a 2 a 4 . . . , while the remaining 
[n/2] elements occupy the odd-numbered positions. Each 2-ordered permutation 
is equally likely after a random file has been 2-sorted. What is the average 
number of inversions among all such permutations? 

Let A n be the total number of inversions among all 2-ordered permutations 
of (1,2, ...,n}. Clearly Ai — 0, A 2 = 1, A 3 = 2; and by considering the six 
cases 

1324 1234 1243 2134 2143 3142 

we find that A 4 = l + 0+ l-t-l-|-2-|-3 = 8. One way to investigate A n in 
general is to consider the “lattice diagram” illustrated in Fig. 11 for n = 15. 
A 2-ordered permutation of {1,2,..., n} can be represented as a path from the 
upper left corner point (0,0) to the lower right corner point ((n/2], L»/2J), if 
we make the kth step of the path go downwards or to the right, respectively, 
according as k appears in an odd or an even position in the permutation. This 
rule defines a one-to-one correspondence between 2-ordered permutations and 
n-step paths from corner to corner of the lattice diagram; for example, the path 
shown by the heavy line in Fig. 11 corresponds to the permutation 

2 1 3 4 6 5 7 10 8 11 9 12 14 13 15. ( 1 ) 

Furthermore, we can attach “weights” to the vertical lines of the path, as Fig. 11 
shows; a line from (i, j) to (i+1, j) gets weight \i — j\. A little study will convince 
the reader that the sum of these weights along each path is equal to the number 
of inversions of the corresponding permutation; this sum also equals the number 


5.2.1 


SORTING BY INSERTION 


87 



Fig. 11. Correspondence between 2-ordering and paths in a lattice. Italicized numbers 
are weights that yield the number of inversions in the 2-ordered permutation. 


of shaded squares between the given path and the staircase path indicated by 
heavy dots in the figure. (See exercise 12.) Thus, for example, (i) has 1 + 0 + 
l + 0+ l + 2 + l-(-0 = 6 inversions. 

When a < a' and b < b', the number of relevant paths from (a. b) to (o', b') 
is the number of ways to mix a' - a vertical lines with b' - b horizontal lines, 
namely 

( a! — a + b' — b\ 

\ a' — a ) ’ 

hence the number of permutations whose corresponding path traverses the ver- 
tical line segment from (i,j) to (*+l, j) is 

/i + A /n-i-j- 1\ 

V i H2\-j J' 

Multiplying by the associated weight and summing over all segments gives 



The absolute value signs in these sums make the calculations somewhat tricky, 
but exercise 14 shows that A n has the surprisingly simple form [n/2j 2 n ~ 2 . Hence 




i 






u ’ I w 



88 


SORTING 


5.2.1 


the average number of inversions in a random 2 -ordered permutation is 


[n/2\ 




by Stirling’s approximation this is asymptotically yV/128 n 3/2 « 0.15 n 3 / 2 . The 
maximum number of inversions is easily seen to be 

A"/2J + 1\ 1 2 

V 2 ) ~ 8” ' 

It is instructive to study the distribution of inversions more carefully, by 
examining the generating functions 


hi{z) = 1, 

*2(z) = 1+ 2, 

M- 2 ) = l + 2z, ^ 

hi(z) = 1 + 3 z -\- z 2 -\- z 3 , . . . , 

as in exercise 15. In this way we find that the standard deviation is also 
proportional to n 3/2 , so the distribution is not extremely stable about the mean. 

Now let us consider the general two-pass case of Algorithm D, when the 
increments are h and 1 : 


Theorem H. The average number of inversions in an h-ordered permutation 
of { 1 , 2 , ... , n} is 


f( n , h) = 


2 2q ~ 1 q! q\ 

(29+1)! 




(4) 


where q = |_n/hj and r — n mod h. 

This theorem is due to Douglas H. Hunt [Bachelor’s thesis, Princeton University 
(April 1967)]. Note that when h > n the formula correctly gives f(n,h) = 1 (”) . 


Proof. An h-ordered permutation contains r sorted subsequences of length q+ 1, 
and h - r of length q. Each inversion comes from a pair of distinct subsequences, 
and a given pair of distinct subsequences in a random h-ordered permutation 
defines a random 2-ordered permutation. The average number of inversions 
is therefore the sum of the average number of inversions between each pair of 
distinct subsequences, namely 



A-2q+2 
f2q + 2\ 

W+i ) 


+ r{h — r) 





f(n,h). | 


Corollary H. If the sequence of increments h t _j, ..., h x , h 0 satisfies the 
condition 

h s+ 1 mod h s = 0 , for t - 1 > s > 0 , ( 5 ) 


5.2.1 


SORTING BY INSERTION 


89 



Fig. 12. The average number, f(n , h), of inversions in an /i-ordered file of n elements, 
shown for n = 64. 

then the average number of move operations in Algorithm D is 

X ( r s/(<7a + l, h s+1 /h s ) + (h a - r s )f(q s , h s+1 /h s )), ( 6 ) 

t>8> 0 

where r s = N mod h s , q s = [N/h s \, h t = TV/i f _i, and f is defined in ( 4 ). 

Proof. The process of /t.,-sorting consists of a straight insertion sort on r s 
( l>s+ i/h s )-°rdered subfiles of length q s + 1 , and on (h a — r s ) such subfiles of 
length q s . The divisibility condition implies that each of these subfiles is a ran- 
dom (/i s+i //i 3 )-ordered permutation, in the sense that each (/i s+ 1 //i s )-ordered 
permutation is equally likely, since we are assuming that the original input was 
a random permutation of distinct elements. | 

Condition ( 5 ) in this corollary is always satisfied for two-pass shellsorts, 
when the increments are h and 1. If q = [N/h\ and r = TV mod h, the quantity 
B in Program D will have an average value of 

rf(q+ 1- N) + (h- r)f(q, N ) + /(TV, fc ) = I («+ X ) + + /(TV, h). 

To a first approximation, the function f{n,h) equals (0F/8 )n 3 / 2 h 1/2 ; we can, 
for example, compare it to the smooth curve in Fig. 12 when n — 64. Hence the 
running time for a two-pass Program D is approximately proportional to 

2T V 2 /h + VttNVi. 

The best choice of h is therefore approximately ^/16N/tt sa 1.72 VN; and with 
this choice of h we get an average running time proportional to TV 5 / 3 . 



90 


SORTING 


5.2.1 


Thus we can make a substantial improvement over straight insertion, from 
0(N 2 ) to 0(N 1 ' 667 ), just by using shellsort with two increments. Clearly we 
can do even better when more increments are used. Exercise 18 discusses the 
optimum choice of h t - 1, . . . , ho when t is fixed and when the h ’ s are constrained 
by the divisibility condition; the running time decreases to 0(N 15+e / 2 ), where 
e = 1/(2* — 1), for large N. We cannot break the N 15 barrier by using the 
formulas above, since the last pass always contributes 

f{NM)-{V^im 3/2 h\ /2 

inversions to the sum. 

But our intuition tells us that we can do even better when the increments 
/it_ i, . . . , ho do not satisfy the divisibility condition ( 5 ). For example, 8 -sorting 
followed by 4-sorting followed by 2-sorting does not allow any interaction between 
keys in even and odd positions; therefore the final 1 -sorting pass is inevitably 
faced with 0(1V 3 / 2 ) inversions, on the average. By contrast, 7-sorting followed 
by 5-sorting followed by 3-sorting jumbles things up in such a way that the final 
1-sorting pass cannot encounter more than 2 N inversions! (See exercise 26.) 
Indeed, an astonishing phenomenon occurs: 

Theorem K. If a k-ordered file is h-sorted, it remains k-ordered. 

Thus a file that is first 7-sorted, then 5-sorted, becomes both 7-ordered and 
5-ordered. And if we 3-sort it, the result is ordered by 7s, 5s, and 3s. Examples 
of this remarkable property can be seen in Table 4 on page 85. 

Proof. Exercise 20 shows that Theorem K is a consequence of the following fact: 

Lemma L. Let m, n, r be nonnegative integers, and let (aq, . . . , x m+r ) and 
(y 1, . . . , y n +r ) be any sequences of numbers such that 

Vl — ^m+ 1) V2 fs x m+2i • • • t Vr 5) %m+r- ( 7 ) 

If the x’s and y’s are sorted independently, so that X\ < ■ ■ • < x m+r and iq < 

• ■ • < y„+ r , the relations ( 7 ) will still be valid. 

Proof. All but m of the x’s are known to dominate (that is, to be greater than 
or equal to) some y, where distinct x’s dominate distinct y’s. Let 1 < j < r. 
Since x m+ j after sorting dominates m + j of the x’s, it dominates at least j of 
the y’s; therefore it dominates the smallest j of the y’s; hence x m+ j > yj after 
sorting. | | 

Theorem K suggests that it is desirable to sort with relatively prime incre- 
ments, but it does not lead directly to exact estimates of the number of moves 
made in Algorithm D. Moreover, the number of permutations of {l,2,...,n} 
that are both h- ordered and /(-ordered is not always a divisor of n!, so we can see 
that Theorem K does not tell the whole story; some k- and /i-ordered files are 
obtained more often than others after k- and /i-sorting. Therefore the average- 
case analysis of Algorithm D for general increments ht- 1 , ..., ho has baffled 
everyone so far when t > 3. There is not even an obvious way to find the worst 


5.2.1 


SORTING BY INSERTION 


91 


case, when N and (ht-i, ■ ■ ■ , fto) are given. We can, however, derive several 
facts about the approximate maximum running time when the increments have 
certain forms: 

Theorem P. The running time of Algorithm D is 0(N 3 / 2 ), when h s = 2 S+1 — 1 
for 0 < s < t — [lg 1VJ . 

Proof. It suffices to bound B s , the number of moves in pass s, in such a way 
that B t _ i + • • • + Bo — 0(N 3 / 2 ). During the first t/2 passes, for t > s > t/2, 
we may use the obvious bound B s — 0(/i s (A r //i s ) 2 ); and for subsequent passes 
we may use the result of exercise 23, B s = 0(Nh s+ 2h s+ i/h s ). Consequently 
£f_ i + • • • + Bo = 0(N{2 + 2 2 + • • • + 2*/ 2 + 2 4 / 2 + • • • + 2)) = 0(1V 3 / 2 ). | 

This theorem is due to A. A. Papernov and G. V. Stasevich, Problemy 
Peredachi Informatsii 1,3 (1965), 81-98. It gives an upper bound on the worst- 
case running time of the algorithm, not merely a bound on the average running 
time. The result is not trivial since the maximum running time when the ft’s 
satisfy the divisibility constraint ( 5 ) is of order N 2 ; and exercise 24 shows that 
the exponent 3/2 cannot be lowered. 

An interesting improvement of Theorem P was discovered by Vaughan Pratt 
in 1969: If the increments are chosen to be the set of all numbers of the form 2 p 3 q 
that are less than N, the running time of Algorithm D is of order N (log N) 2 . In 
this case we can also make several important simplifications to the algorithm; see 
exercises 30 and 31. However, even with these simplifications, Pratt’s method 
requires a substantial overhead because it makes quite a few passes over the data. 
Therefore his increments don’t actually sort faster than those of Theorem P in 
practice, unless N is astronomically large. The best sequences for real-world N 
appear to satisfy h s « p s , where the ratio p ss h s+ i/h s is roughly independent 
of s but may depend on N. 

We have observed that it is unwise to choose increments in such a way that 
each is a divisor of all its predecessors; but we should not conclude that the best 
increments are relatively prime to all of their predecessors. Indeed, every element 
of a file that is grft-sorted and gfc-sorted with ft _L k has at most |(ft — 1 )(k — 1) 
inversions when we are gr-sorting. (See exercise 21.) Pratt’s sequence {2 P 3 9 } 
wins as N — ► 00 by exploiting this fact, but it grows too slowly for practical use. 

Janet Incerpi and Robert Sedgewick [J. Comp. Syst. Sci. 31 (1985), 210-224; 
see also Lecture Notes in Comp. Sci. 1136 (1996), 1-11] have found a way to have 
the best of both worlds, by showing how to construct a sequence of increments 
for which h s sa p s yet each increment is the gcd of two of its predecessors. Given 
any number p > 1 , they start by defining a base sequence a±, a 2 , . . . , where is 
the least integer > p k such that a 3 J_ a*, for 1 < j < k. If p — 2.5, for example, 
the base sequence is 

a u a 2 , a 3 , ... — 3, 7, 16, 41, 101, 247, 613, 1529, 3821, 9539, .... 

Now they define the increments by setting fto = 1 and 

for (])<<<('/)■ 


h' g h' g “p CLy* 


( 8 ) 


92 


SORTING 


5.2.1 


Thus the sequence of increments starts 

1; ai; 02, aia 2 ; aia 3 , a 2 a 3 , aia 2 a 3 ; .... 

For example, when p = 2.5 we get 

1, 3, 7, 21, 48, 112, 336, 861, 1968, 4592, 13776, 33936, 86961, 198768, .... 
The crucial point is that we can turn recurrence (8) around: 

h s = h r+s /a r = h^/a^_ s for ^ ^ ) < « < • (9) 

Therefore, by the argument in the previous paragraph, the number of inversions 
per element when we are h 0 -sorting, hi-sorting, ... is at most 

H a 2, «i); K a 3, a 2 ), b(a 3 , Oi); b(a 4 , a 3 ), 6(04, a 2 ), b(a 4 , ai); . . . (10) 

where b(h, k ) = |(h — l)(/c — 1). If p t_1 < N < p*, the total number B of moves 
is at most N times the sum of the first t elements of this sequence. Therefore 
(see exercise 41) we can prove that the worst-case running time is much better 
than order TV 15 : 

Theorem I. The running time for Algori thm D is 0{Ne c v/ *"~^) when the in- 
crements h s are defined by (8). Here c = V& In p and the constant implied by O 
depends on p. | 

This asymptotic upper bound is not especially important as N — » 00, 
because Pratt’s sequence does better. The main point of Theorem I is that 
a sequence of increments with the practical growth rate h s ss p s can have a 
running time that is guaranteed to be 0(N l+f ) for arbitrarily small e > 0, when 
any value p > 1 is given. 

Let’s consider practical sizes of N more carefully by looking at the total 
running time of Program D, namely (9S + 101VT+13T-105-3^+l)u. Table 5 
shows the average running time for various sequences of increments when N = 8. 
For this small value of N, bookkeeping operations are the most significant part 
of the cost, and the best results are obtained when t = 1; hence for N = 8 
we are better off using simple straight insertion. (The average running time of 
Program S when iV = 8 is only 191. 85u.) Curiously, the best two-pass algorithm 
occurs when hi = 6, since a large value of S is more important here than a 
small value of B. Similarly, the three increments 3 2 1 minimize the average 
number of moves, but they do not lead to the best three-pass sequence. It may 
be of interest to record here some “worst-case” permutations that maximize the 
number of moves, since the general construction of such permutations is still 
unknown: 

h 2 = 5, hi = 3, ho = 1: 8 5 2 6 3 741 (19 moves) 

h 2 = 3, hi = 2, ho = 1: 8 3 5 72 461 (17 moves) 


5.2.1 


SORTING BY INSERTION 


93 


Table 5 

ANALYSIS OF ALGORITHM D WHEN N = 8 


Increments 

Vlave 

-^ave 

S 

T 

MIX time 

1 

1.718 

14.000 

1 

1 

204.85m 

2 1 

2.667 

9.657 

3 

2 

235.91m 

3 1 

2.917 

9.100 

4 

2 

220.15m 

4 1 

3.083 

10.000 

5 

2 

217.75m 

5 1 

2.601 

10.000 

6 

2 

209.20m 

6 1 

2.135 

10.667 

7 

2 

206.60m 

7 1 

1.718 

12.000 

8 

2 

209.85m 

4 2 1 

3.500 

8.324 

7 

3 

274.42m 

5 3 1 

3.301 

8.167 

9 

3 

253.60m 

3 2 1 

3.320 

7.829 

6 

3 

280.50m 


As N grows larger we have a slightly different picture. Table 6 shows 
the approximate number of moves for various sequences of increments when 
N = 1000. The first few entries satisfy the divisibility constraints (5), so 
that formula (6) and exercise 19 can be used; empirical tests were used to 
get approximate average values for the other cases. Ten thousand random files 
of 1000 elements were generated, and they each were sorted with each of the 
sequences of increments. The standard deviation of the number of left-to-right 
minima A was usually about 15; the standard deviation of the number of moves 
B was usually about 300. 

Some patterns are evident in this data, but the behavior of Algorithm D still 
remains very obscure. Shell originally suggested using the increments [N/2\, 
|AV 4j , . . . , but this is undesirable when the binary representation of N 

contains a long string of zeros. Lazarus and Frank [CACM 3 (1960), 20-22] 
suggested using essentially the same sequence, but adding 1 when necessary, 
to make all increments odd. Hibbard [CACM 6 (1963), 206-213] suggested 
using increments of the form 2 fc — 1; Papernov and Stasevich suggested the form 
2 k + 1. Other natural sequences investigated in Table 6 involve the numbers 
(2 k - ( — l) fc )/3 and (3 fc - l)/2, as well as Fibonacci numbers and the Incerpi- 
Sedgewick sequences (8) for p = 2.5 and p = 2. Pratt-like sequences {5 P 11 9 } 
and {7 ! T3' ? } are also shown, because they retain the asymptotic (9 (A’ (log A ,r ) 2 j 
behavior but have lower overhead costs for small N. The final examples in 
Table 6 come from another sequence devised by Sedgewick, based on slightly 
different heuristics [J. Algorithms 7 (1986), 159-173]: 

l _ / 9 • 2 s — 9 • 2 s / 2 + 1, if s is even; , , 

s \8-2 5 — 6-2< s+1 >/ 2 + 1, if sis odd. (ll) 

When these increments (h 0 , h x , h 2 , . . . ) - (1,5,19,41,109,209,...) are used, 
Sedgewick proved that the worst-case r unni ng time is 0(N 4 / 3 ). 

The minimum number of moves, about 6750, was observed for increments 
of the form 2 k + 1, and also in the Incerpi-Sedgewick sequence for p = 2. But it 
is important to realize that the number of moves is not the only consideration, 


94 


SORTING 


5.2.1 


Table 6 

APPROXIMATE BEHAVIOR OF ALGORITHM D WHEN N = 1000 







Increments 





•^ave 

B&ve 

T 












1 

6 

249750 

1 











17 

1 

65 

41667 

2 










60 

6 

1 

158 

26361 

3 









140 

20 

4 

1 

262 

21913 

4 








256 

64 

16 

4 

1 

362 

20459 

5 







576 

192 

48 

16 

4 

1 

419 

20088 

6 






729 

243 

81 

27 

9 

3 

1 

378 

18533 

7 



512 

256 

128 

64 

32 

16 

8 

4 

2 

1 

493 

16435 

10 




500 

250 

125 

62 

31 

15 

7 

3 

1 

516 

7655 

9 




501 

251 

125 

63 

31 

15 

7 

3 

1 

558 

7370 

9 




511 

255 

127 

63 

31 

15 

7 

3 

1 

559 

7200 

9 





255 

127 

63 

31 

15 

7 

3 

1 

436 

7445 

8 






127 

63 

31 

15 

7 

3 

1 

299 

8170 

7 







63 

31 

15 

7 

3 

1 

190 

9860 

6 








31 

15 

7 

3 

1 

114 

13615 

5 



513 

257 

129 

65 

33 

17 

9 

5 

3 

1 

561 

6745 

10 




257 

129 

65 

33 

17 

9 

5 

3 

1 

440 

6995 

9 





129 

65 

33 

17 

9 

5 

3 

1 

304 

7700 

8 






65 

33 

17 

9 

5 

3 

1 

197 

9300 

7 







33 

17 

9 

5 

3 

1 

122 

12695 

6 



683 

341 

171 

85 

43 

21 

11 

5 

3 

1 

511 

7365 

10 




341 

171 

85 

43 

21 

11 

5 

3 

1 

490 

7490 

9 







255 

63 

15 

7 

3 

1 

373 

8620 

6 







257 

65 

17 

5 

3 

1 

375 

8990 

6 







341 

85 

21 

5 

3 

1 

410 

9345 

6 

377 233 

144 

89 

55 

34 

21 

13 

8 

5 

3 

2 

1 

518 

7400 

13 

233 

144 

89 

55 

34 

21 

13 

8 

5 

3 

2 

1 

432 

7610 

12 






377 

144 

55 

21 

8 

3 

1 

456 

8795 

7 






365 

122 

41 

14 

5 

2 

1 

440 

8085 

7 







364 

121 

40 

13 

4 

1 

437 

8900 

6 








121 

40 

13 

4 

1 

268 

9790 

5 






336 

112 

48 

21 

7 

3 

1 

432 

7840 

7 




306 

170 

90 

45 

18 

10 

5 

2 

1 

465 

6755 

9 







169 

91 

49 

13 

7 

1 

349 

8698 

6 





275 

125 

121 

55 

25 

11 

5 

1 

446 

6788 

8 






190 

84 

37 

16 

7 

3 

1 

359 

7201 

7 





929 

505 

209 

109 

41 

19 

5 

1 

512 

7725 

8 






505 

209 

109 

41 

19 

5 

1 

519 

7790 

7 







209 

109 

41 

19 

5 

1 

382 

8165 

6 


even though it dominates the asymptotic running time. Since Program D takes 
9B + 10 (NT - S) + ■ ■ ■ units of time, we see that saving one pass is about as 
desirable as saving * f-N moves; when N = 1000 we are willing to add 1111 moves 
if we can save one pass. (The first pass is very quick, however, if h t -i is near N. 
because NT - S = (N - h t -i) + ■ ■ ■ + (N - h 0 ).) 


5.2.1 


SORTING BY INSERTION 


95 


Empirical tests conducted by M. A. Weiss [Comp. J. 34 (1991), 88-91] 
suggest strongly that the average number of moves performed by Algorithm D 
with increments 2 fc — 1, . . . , 15, 7, 3, 1 is approximately proportional to A 5 / 4 . 
More precisely, Weiss found that B ave ~ 1.55A 5 / 4 — 4.48A + 0(A 3 / 4 ) for 
100 < A < 12000000 when these increments are used; the empirical standard 
deviation was approximately .0657V 5 / 4 . On the other hand, subsequent tests by 
Marcin Ciura show that Sedgewick’s sequence (n) apparently makes Z? ave = 
0(A(log A) 2 ) or better. The standard deviation for sequence (n) is amazingly 
small for A < 10 6 , but it mysteriously begins to “explode” when A passes 10 7 . 

Table 7 shows typical breakdowns of moves per pass obtained in three 
random experiments, using increments of the forms 2 fc — 1, 2 fc + 1, and (n). 
The same file of numbers was used in each case. The total number of moves, 
^2 B s , comes to 346152, 329532, 248788 in the three cases, so sequence (n) is 
clearly superior in this example. 


Table 7 

MOVES PER PASS: EXPERIMENTS WITH N = 20000 


hs 

B s 

hs 

B s 

hs 

B s 

4095 

19458 

4097 

19459 

3905 

20714 

2047 

15201 

2049 

14852 

2161 

13428 

1023 

16363 

1025 

15966 

929 

18206 

511 

18867 

513 

18434 

505 

16444 

255 

23232 

257 

22746 

209 

21405 

127 

28034 

129 

27595 

109 

19605 

63 

33606 

65 

34528 

41 

26604 

31 

40350 

33 

45497 

19 

23441 

15 

66037 

17 

48717 

5 

38941 

7 

43915 

9 

38560 

1 

50000 

3 

24191 

5 

20271 



1 

16898 

3 

9448 





1 

13459 




Although Algorithm D is gradually becoming better understood, more than 
three decades of research have failed to turn up any grounds for making strong 
assertions about what sequences of increments make it work best. If A is less 
than 1000, a simple rule such as 

Let h 0 = 1, h s+ 1 — 3 h s + 1, and stop with h t -\ when h t+1 > A ( 12 ) 

seems to be about as good as any other. For larger values of A, Sedgewick’s 
sequence ( 11 ) can be recommended. Still better results, possibly even of order 
A log A, have been reported by N. Tokuda using the quantity [2.25h s \ in place 
of 3 h s in ( 12 ); see Information Processing 92 1 (1992), 449-457. 

List insertion. Let us now leave shellsort and consider other types of im- 
provements over straight insertion. One of the most important general ways to 
improve on a given algorithm is to examine its data structures carefully, since 


96 


SORTING 


5.2.1 


a reorganization of data structures to avoid unnecessary operations often leads 
to substantial savings. Further discussion of this general idea appears in Section 
2.4, where a rather complex algorithm is studied; let us consider how it applies 
to a very simple algorithm like straight insertion. What is the most appropriate 
data structure for Algorithm S? 

Straight insertion involves two basic operations: 

i) scanning an ordered file to find the largest key less than or equal to a given 
key; and 

ii) inserting a new record into a specified part of the ordered file. 

The file is obviously a linear list, and Algorithm S handles this list by using 
sequential allocation (Section 2.2.2); therefore it is necessary to move roughly 
half of the records in order to accomplish each insertion operation. On the 
other hand, we know that linked allocation (Section 2.2.3) is ideally suited to 
insertion, since only a few links need to be changed; and the other operation, 
sequential scanning, is about as easy with linked allocation as with sequential 
allocation. Only one-way linkage is needed, since we always scan the list in the 
same direction. Therefore we conclude that the right data structure for straight 
insertion is a one-way, linked linear list. It also becomes convenient to revise 
Algorithm S so that the list is scanned in increasing order: 

Algorithm L ( List insertion). Records R\ , . . . , are assumed to contain keys 
Ki, . . . , K n , together with link fields L\, . . . , Ln capable of holding the numbers 
0 through N, there is also an additional link field Lq, in an artificial record 
R 0 at the beginning of the file. This algorithm sets the link fields so that the 
records are linked together in ascending order. Thus, ifp(l) . . .p(N) is the stable 
permutation that makes AT p (i) < • ■ • < K p (n) , this algorithm will yield 

-^o — pi l)^ Lp(i) ~ P{i T 1), for 1 < i < N', Tp(./v) = 0. ( 13 ) 

LI. [Loop on j.] Set L 0 4 - N, L N 4 - 0. (Link Lq acts as the “head” of the list, 
and 0 acts as a null link; hence the list is essentially circular.) Perform steps 
L2 through L5 for j = N— 1 , N — 2, . . . , 1 ; then terminate the algorithm. 

L2. [Set up p, q, K .] Set p 4 — Lo, q 4— 0, K 4— Kj. (In the following steps we 
will insert R :t into its proper place in the linked list, by comparing K with 
the previous keys in ascending order. The variables p and q act as pointers 
to the current place in the list, with p = L q so that q is one step behind p.) 

L3. [Compare K : K p ] If K < K p , go to step L5. (We have found the desired 
position for record R, between R q and R p in the list.) 

L4. [Bump p, q.] Set q 4 - p, p 4 — L q . If p > 0, go back to step L3. (If p — 0, 
K is the largest key found so far; hence record R belongs at the end of the 
list, between R q and R 0 .) 

L5. [Insert into list.] Set L q 4 - j, Lj 4 - p. \ 

This algorithm is important not only because it is a simple sorting method, 
but also because it occurs frequently as part of other list-processing algorithms. 


5.2.1 


SORTING BY INSERTION 


97 


Table 8 shows the first few steps that occur when our sixteen example numbers 
are sorted; exercise 32 gives the final link setting. 


Table 8 

EXAMPLE OF LIST INSERTION 

j- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Kf. - 503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 
Lj: 16 — — — — — — — — — — — — — — — o 

Lj-. 16 — — — — — — — — — — — — — — 015 

Lj\ 14 — — — — — — — — — — — — — 16 015 


Program L ( List insertion ). We assume that Kj is stored in INPUT+j (0:3), 


and Lj is 

stored 

in INPUT+j (4:5). 

rll = j; rI2 

p; rI3 = q; rA(0:3) = K. 

01 

KEY 

EQU 

0:3 



02 

LINK 

EQU 

4:5 



03 

START ENT1 

N 

1 

LI. Loop on ?. ? 4— TV. 

04 


ST1 

INPUT (LINK) 

1 

L 0 «- TV. 

05 


STZ 

INPUT+NCLINK) 

1 

Ln 4 — 0. 

06 


JMP 

6F 

1 

Go to decrease j. 

07 

2H 

LD2 

INPUT (LINK) 

TV — 1 

L2. Set up p. a. K. v 4— 

08 


ENT3 

0 

TV- 1 

q 4- 0. 

09 


LDA 

INPUT, 1 

TV- 1 

K 4- Kj. 

10 

3H 

CMPA 

INPUT, 2 (KEY) 

B+N-l-A 

L3. Compare K : K„. 

11 


JLE 

5F 

B+N-l-A 

To L5 if K < K p . 

12 

4H 

ENT3 

0,2 

B 

L4. Bump p, q. q 4— p. 

13 


LD2 

INPUT, 3 (LINK) 

B 

p 4 Lg. 

U 


J2P 

3B 

B 

To L3 if p > 0. 

15 

5H 

ST1 

INPUT, 3 (LINK) 

TV- 1 

L5. Insert into list. L„ 4- 

16 


ST2 

INPUT, 1 (LINK) 

TV- 1 

Lj 4— p. 

17 

6H 

DEC1 

1 

TV 


18 


J1P 

2B 

TV 

TV > j > 1. | 


The running time of this program is 7 B + 14 TV - 3d - 6 units, where TV is 
the length of the file, A + 1 is the number of right-to-left maxima, and B is the 
number of inversions in the original permutation. (See the analysis of Program S. 
Note that Program L does not rearrange the records in memory; this can be done 
as in exercise 5.2-12, at a cost of about 207V additional units of time.) Program S 
requires (9 B + 107V — 3 A — 9)u, and since B is about |7V 2 , we can see that the 
extra memory space used for the link fields has saved about 22 percent of the 
execution time. Another 22 percent can be saved by careful programming (see 
exercise 33), but the running time remains proportional to TV 2 . 

To summarize what we have done so far: We started with Algorithm S, 
a simple and natural sorting algorithm that does about ~N 2 comparisons and 
\N 2 moves. We improved it in one direction by considering binary insertion, 
which does about TVlgTV comparisons and |7 V 2 moves. Changing the data 


98 


SORTING 


5.2.1 


Y 

061 087*503 


154 170 275 426 


512 *908 


509 


612 653 




897 


677 


765 


Y 

703 


Fig. 13. Example of Wheeler’s tree insertion scheme. 


structure slightly with “two-way insertion” cuts the number of moves down 
to about | TV 2 . Shellsort cuts the number of comparisons and moves to about 
TV 7 6 , for TV in^a practical range; as TV -> oc this number can be lowered to 
order TV(logTV) 2 . Another way to improve on Algorithm S, using a linked data 
structure, gave us the list insertion method, which does about | TV 2 comparisons, 
0 moves, and 2TV changes of links. 

Is it possible to marry the best features of these methods, reducing the 
number of comparisons to order TV log TV as in binary insertion, yet reducing 
the number of moves as in list insertion? The answer is yes, by going to a 
tree-structured arrangement. This possibility was first explored about 1957 by 
D. J. Wheeler, who suggested using two-way insertion until it becomes necessary 
to move some data; then instead of moving the data, a pointer to another area 
of memory is inserted, and the same technique is applied recursively to all items 
that are to be inserted into this new area of memory. Wheeler’s original method 
[see A. S. Douglas, Comp. J. 2 (1959), 5] was a complicated combination of 
sequential and linked memory, with nodes of varying size; for our 16 example 
numbers the tree of Fig. 13 would be formed. A similar but simpler tree-insertion 
scheme, using binary trees, was devised by C. M. Berners-Lee about 1958 [see 
Comp. J. 3 (1960), 174, 184], Since the binary tree method and its refinements 
are quite important for searching as well as sorting, they are discussed at length 
in Section 6.2.2. 


Still another way to improve on straight insertion is to consider inserting 
several things at a time. For example, if we have a file of 1000 items, and 
if 998 of them have already been sorted, Algorithm S makes two more passes 
through the file (first inserting -R 999 , then -Riooo)- We can obviously save time 
if we compare AT 999 with A 100 o, to see which is larger, then insert them both 
with one look at the file. A combined operation of this kind involves about -TV 
comparisons and moves (see exercise 3. 4. 2-5), instead of two passes each with 
about § TV comparisons and moves. 


In other words, it is generally a good idea to “batch” operations that require 
long searches, so that multiple operations can be done together. If we carry this 
idea to its natural conclusion, we rediscover the method of sorting by merging, 
which is so important it is discussed in Section 5 . 2 . 4 . 


5.2.1 


SORTING BY INSERTION 


99 


Address calculation sorting. Surely by now we have exhausted all possible 
ways to improve on the simple method of straight insertion; but let’s look again! 
Suppose you want to arrange several dozen books on your bookshelves, in order 
by authors’ names, when the books are given to you in random order. You’ll 
naturally try to estimate the final position of each book as you put it in place, 
thereby reducing the number of comparisons and moves that you’ll have to make. 
And the whole process will be somewhat more efficient if you start with a little 
more shelf space than is absolutely necessary. This method was first suggested 
for computer sorting by Isaac and Singleton, JACM 3 (1956), 169-174, and it 
was developed further by Tarter and Kronmal, Proc. ACM National Conference 
21 (1966), 331-337. 

Address calculation sorting usually requires additional storage space propor- 
tional to N, either to leave enough room so that excessive moving is not required, 
or to maintain auxiliary tables that account for irregularities in the distribution 
of keys. (See the “distribution counting” sort, Algorithm 5. 2D, which is a form 
of address calculation.) We can probably make the best use of this additional 
memory space if we devote it to link fields, as in the list insertion method. In this 
way we can also avoid having separate areas for input and output; everything 
can be done in the same area of memory. 

These considerations suggest that we generalize list insertion so that several 
lists are kept, not just one. Each list is used for certain ranges of keys. We 
make the important assumption that the keys are pretty evenly distributed, not 
“bunched up” irregularly: The set of all possible values of the keys is partitioned 
into M parts, and we assume a probability of 1 /M that a given key falls into a 
given part. Then we provide additional storage for M list heads, and each list 
is maintained as in simple list insertion. 

It is not necessary to give the algorithm in great detail here; the method 
simply begins with all list heads set to A. As each new item enters, we first decide 
which of the M parts its key falls into, then we insert it into the corresponding 
list as in Algorithm L. 

To illustrate this approach, suppose that the 16 keys used in our examples 
are divided into the M = 4 ranges 0-249, 250-499, 500-749, 750-999. We 
obtain the following configurations as the keys K\, K 2 , . . K\q are successively 


inserted: 

After 

After 

After 

Final 


4 items: 

8 items: 

12 items: 

state: 

List 1: 

061,087 

061,087, 170 

061,087,154,170 

061,087,154,170 

List 2: 


275 

275,426 

275, 426 

List 3: 

503,512 

503,512 

503,509,512,653 

503, 509, 512, 612, 653, 677, 703 

List 4: 


897, 908 

897, 908 

765,897,908 


(Program M below actually inserts the keys in reverse order, A'i6, • • • , K 2 , K\, 
but the final result is the same.) Because linked memory is used, the varying- 
length lists cause no storage allocation problem. All lists can be combined into 
a single list at the end, if desired (see exercise 35). 


100 SORTING 


5.2.1 


Program M ( Multiple list insertion). In this program we make the same 
assumptions as in Program L, except that the keys must be nonnegative , thus 

0 < Kj < (BYTESIZE) 3 . 

The program divides this range into M equal parts by multiplying each key by a 
suitable constant. The list heads are in locations HEAD+1 through HEAD+M. 


01 

KEY 

EQU 

1:3 



02 

LINK 

EQU 

4:5 



03 

START 

ENT2 

M 

1 


04 


STZ 

HEAD, 2 

M 

HEAD [p] +- A. 

05 


DEC2 

1 

M 


06 


J2P 

*-2 

M 

M > p > 1. 

07 


ENT1 

N 

1 

j t— IV. 

08 

2H 

LDA 

INPUT, 1 (KEY) 

N 


09 


MUL 

=M(1:3)= 

N 

rA <— [M ■ Kj j BYTESIZE 3 ] 

10 


STA 

*+1(1 : 2) 

N 


11 


ENT4 

0 

N 

rI4 <— rA. 

12 


ENT3 

HEAD+1- INPUT, 4 

N 

q +- L0C(HEAD [rA] ) . 

13 


LDA 

INPUT, 1 

N 

K +- Kj. 

14 


JMP 

4F 

N 

Jump to set p. 

15 

3H 

CMPA 

INPUT, 2 (KEY) 

B + N - A 


16 


JLE 

5F 

B + N - A 

Jump to insert, if K < K p . 

17 


ENT3 

0,2 

B 

9 P- 

18 

4H 

LD2 

INPUT, 3 (LINK) 

B + N 

p +- LINK(g) . 

19 


J2P 

3B 

B + N 

Jump if not end of list. 

20 

5H 

ST1 

INPUT, 3 (LINK) 

N 

LINK(g) +- LOC(flj). 

21 


ST2 

INPUT, 1 (LINK) 

N 

LINK(L0C(^)) <- p. 

22 

6H 

DEC1 

1 

N 


23 


J1P 

2B 

N 

N>j> 1 . | 


This program is written for general M, but it would be better to fix M 
at some convenient value; for example, we might choose M = BYTESIZE, so 
that the list heads could be cleared with a single MOVE instruction and the 
multiplication sequence of lines 08-11 could be replaced by the single instruc- 
tion LD4 INPUT ,1(1:1). The most notable contrast between Program L and 
Program M is the fact that Program M must consider the case of an empty list, 
when no comparisons are to be made. 


How much time do we save by having M lists? The total running time of 
Program M is 7 B + 31N — 3^4 + 4 M + 2 units, where M is the number of lists 
and N is the number of records sorted; A and B respectively count the right-to- 
left maxima and the inversions present among the keys belonging to each list. 
(In contrast to other time analyses of this section, the rightmost element of a 
nonempty permutation is included in the count A.) We have already studied 
A and B for M = 1. when their average values are respectively H N and A(^). 
By our assumption about the distribution of keys, the probability that a given 
list contains precisely n items at the conclusion of sorting is the “binomial” 
probability 

/ /V \ / I \ 11 / I \ —n 

(14) 


"VJYYj- L) N - 

n) \M) V M) 


5.2.1 


SORTING BY INSERTION 101 


Therefore the average values of A and B in the general case are 

n 

B — M T.( N n ) (ffH 1 -®)'"" (;)/*• <*> 

n 


Using the identity 


( N \ fn\ _ (N\ /N- 
\ n ) V 2/ V 2 / V n - 


which is a special case of Eq. 1.2.6-(2o), we can easily evaluate the 


sum in 


(16): 


B 


ave 


l 

2 M 



(i7) 


And exercise 37 derives the standard deviation of B. But the sum in (15) is 
more difficult. By Theorem 1.2.7A, we have 


E O (M - i r nH n = (l - ^)~Viv - InM) + e, 

n 


0 



1 \n—N 

Mj < 


M - 1 
N+ 1 ’ 


hence 


M 2 / 1 \ w +i 

A me = M(H N -\nM) + S, 0 < S < (l - . (18) 

(This formula is practically useless when M & N; exercise 40 gives a more 
detailed analysis of the asymptotic behavior of A ave when M = N/a.) 

By combining (17) and (18) we can deduce the total running time of Pro- 
gram M, for fixed M as N — » 00: 


min 31AT + M + 2, 

ave 1.75 N 2 /M + 3UV - 3 MH N + 3M In M + AM - 35 - 1.75 N/M + 2, 
max 3.50 N 2 + 24.51V + AM + 2. (19) 

Notice that when M is not too large we are speeding up the average time by 
a factor of M; M — 10 will sort about ten times as fast as M = 1. However, 
the maximum time is much larger than the average time; this reiterates the 
assumption we have made about a fairly equal distribution of keys, since the 
worst case occurs when all records pile onto the same list. 

If we set M — N, the average running time of Program M is approximately 
34.36A r units; when M = |A r it is slightly more, approximately 34.52A r ; and 
when M = it is approximately 48.04A7 The additional cost of the sup- 
plementary program in exercise 35, which links all M lists together in a single 
list, raises these times respectively to 44.99^, 41.951V, and 52.74 N. (Note that 


102 SORTING 


5.2.1 


10 N of these MIX time units are spent in the multiplication instruction alone!) 
We have achieved a sorting method of order N, provided only that the keys are 
reasonably well spread out over their range. 

Improvements to multiple list insertion are discussed in Section 5.2.5. 

EXERCISES 

1. [ 10] Is Algorithm S a stable sorting algorithm? 

2. [11] Would Algorithm S still sort numbers correctly if the relation “K > K” in 
step S3 were replaced by “ K > AY’? 

► 3. [30] Is Program S the shortest possible sorting program that can be written for 
MIX, or is there a shorter program that achieves the same effect? 

► 4. [M20] Find the minimum and maximum running times for Program S, as a 
function of N. 

► 5. [M2 7] Find the generating function <7 at (z) = ^2k>o PNkZ k for the total running 
time of Program S, where p Nk is the probability that Program S takes exactly k units 
of time, given a random permutation of {1,2*. . .,1V} as input. Also calculate the 
standard deviation of the running time, given N. 

6. [23] The two-way insertion method illustrated in Table 2 seems to imply that 
there is an output area capable of holding up to 2 N + 1 records, in addition to the 
input area containing N records. Show that two-way insertion can be done using only 
enough space for N + 1 records, including both input and output. 

7. [M20] If ai a 2 . . . a„ is a random permutation of {1, 2, . . . , n}, what is the average 
value of |ai — 1| + |o 2 — 2| + • • • + |o„ — n|? (This is n times the average net distance 
traveled by a record during a sorting process.) 

8. [10] Is Algorithm D a stable sorting algorithm? 

9. [20] What are the quantities A and B, and the total running time of Program D, 
corresponding to Tables 3 and 4? Discuss the relative merits of shellsort versus straight 
insertion in this case. 

► 10. [22] If Kj > Kj- h when we begin step D3, Algorithm D specifies a lot of actions 
that accomplish nothing. Show how to modify Program D so that this redundant 
computation can be avoided, and discuss the merits of such a modification. 

11. [M10] What path in a lattice like that of Fig. 11 corresponds to the permutation 
1 2 5 3 7 4 8 6 9 11 10 12? 

12. [M20] Prove that the area between a lattice path and the staircase path (as shown 
in Fig. 11) equals the number of inversions in the corresponding 2-ordered permutation. 

► 13. [M16] Explain how to put weights on the horizontal line segments of a lattice, 
instead of the vertical segments, so that the sum of the horizontal weights on a lattice 
path is the number of inversions in the corresponding 2-ordered permutation. 

14. [ M28 ] (a) Show that, in the sums defined by Eq. (2), we have A 2n+ 1 = 2 A 2n . 
(b) The general identity of exercise 1.2.6-26 simplifies to 

s^f2k + s\ k _ 1 ( 1 — y/1 — 42 Y 

V ' k > ~ \Tt~~4z V 2z ) 

if we set r = s, t = -2. By considering the sum J2 n A 2n z n , show that 

A 2n = n ■ 4 n 1 . 


5.2.1 


SORTING BY INSERTION 103 


► 15 . [HM33] Let g„(z ), g n {z), h n (z), and h n (z ) be 2 total weight of path summe( j over 
all lattice paths of length 2 n from ( 0 , 0 ) to ( n,n ), where the weight is defined as in 
Fig. 11, subject to certain restrictions on the vertices on the paths: For h n (z), there is 
no restriction, but for g n (z) the path must avoid all vertices (i, j) with i > j; h n (z) and 
g n (z) are defined similarly, except that all vertices (i,i) are also excluded, for 0 < i < n. 
Thus 


ffo(z) = 1 , g-i(z) = z, g 2 (z) = z 3 + z 2 ; gi(z) = z, g 2 (z) = z 3 ; 


ho(z) = 1, hi (z) = 2 + 1 , h 2 (z) = z 3 + z 2 + 3z + 1; 

h\(z) = 2 + 1 , h 2 {z) = 2 3 + 2. 

Find recurrence relations defining these functions, and use these relations to prove that 


hn{l)+h' n ( 1) 


7 n 3 + 4n 2 + 4n 
30 



(The exact formula for the variance of the number of inversions in a random 2-ordered 
permutation of { 1 , 2 , . . . , 2 n} is therefore easily found; it is asymptotically (^ — j^)n 3 .) 

16 . [M24] Find a formula for the maximum number of inversions in an h-ordered 
permutation of {1,2, ...,n}. What is the maximum possible number of moves in 
Algorithm D when the increments satisfy the divisibility condition ( 5 )? 

17 . [ M21 ] Show that, when N = 2 f and h s = 2 s for t > s > 0, there is a unique 
permutation of {1, 2, ... , N} that maximizes the number of move operations performed 
by Algorithm D. Find a simple way to describe this permutation. 

18 . [HM24] For large N the sum ( 6 ) can be estimated as 


1 N 2 y/n ( N i/9 h\L\ N 3 ' 2 h\ /2 \ 

4 ht - 1 8 y hf -2 ho J 


What real values of ht-i , . . . , ho minimize this expression when N and t are fixed and 

h 0 = 1? 

► 19 . [ M25 ] What is the average value of the quantity A in the timing analysis of 
Program D, when the increments satisfy the divisibility condition ( 5 )? 

20. [M22] Show that Theorem K follows from Lemma L. 

21 . [M25] Let h and k be relatively prime positive integers, and say that an integer 
is generable if it equals xh + yk for some nonnegative integers x and y. Show that n 
is generable if and only if hk — h — k — n is not generable. (Since 0 is the smallest 
generable integer, the largest nongenerable integer must therefore be hk — h — k. It 
follows that Ki < Kj whenever j — i > (h — l)(fc — 1 ), in any file that is both h-ordered 
and fc-ordered.) 

22. [M30] Prove that all integers > 2 s (2 s — 1) can be represented in the form 

uo(2 3 — 1 ) + ai(2 3+1 — 1 ) + a 2 ( 2 s+2 — !) + •••, 


where the a/ s are nonnegative integers; but 2 s ( 2 s — 1 ) — 1 cannot be so represented. 
Furthermore, exactly 2 S_ 1 (2 S + s — 3) positive integers are unrepresentable in this form. 

Find analogous formulas when the quantities 2 k — 1 are replaced by 2 k + 1 in the 
representations . 


104 SORTING 


5.2.1 


► 23. [M2 2] Prove that if h s+2 and h s+1 are relatively prime, the number of moves that 
occur while Algorithm D is using the increment h s is 0{Nh s+2 h s+1 /h s ). Hint: See 
exercise 21. 

24. [ M42 } Prove that Theorem P is best possible, in the sense that the exponent 3/2 
cannot be lowered. 

► 25. [M22] How many permutations of {1,2...., ,V } are both 3-ordered and 2-ordered? 
What is the maximum number of inversions in such a permutation? What is the total 
number of inversions among all such permutations? 

26. [ M35 ] Can a file of N elements have more than N inversions if it is 3-, 5-, and 
7-ordered? Estimate the maximum number of inversions when N is large. 

27. [M 41 ] (Bjorn Poonen.) (a) Prove that there is a constant c such that if m of the 
increments h s in Algorithm D are less than N/2, the running time is f2(V 1+c/ ' / ” r ) in the 
worst case, (b) Consequently the worst-case running time is fi (IV (log N / log log iV ) 2 ) 
for all sequences of increments. 

28. [15] Which sequence of increments shown in Table 6 is best from the standpoint 
of Program D, considering the average total running time? 

29. [40] For N = 1000 and various values of t, find empirical values of h t -i 

hi, h 0 for which the average number of moves, B ave , is as small as you can make it. 

30. [M23] (V. Pratt.) If the set of increments in shellsort is {2 P 3 9 | 2 p 3 q < N}, 
show that the number of passes is approximately l(log 2 N)(log 3 N). and the number 
of moves per pass is at most N/2. In fact, if Kj-h > Kj on any pass, we will always 
have Kj-3h, A/- 2/1 < Kj < Kj-h < Kj + h, K J+ 2 h ; so we may simply interchange Kj-h 
and Kj and increase j by 2 h, saving two of the comparisons of Algorithm D. Hint: See 
exercise 25. 

► 31. [25] Write a MIX program for Pratt’s sorting algorithm (exercise 30). Express its 
running time in terms of quantities A, B, S, T. N analogous to those in Program D. 

32. [10] What would be the final contents of Lq L\ . . . L 1 0 if the list insertion sort in 
Table 8 were carried through to completion? 

► 33. [25] Find a way to improve on Program L so that its running time is dominated 
by 5 B instead of 7 B, where B is the number of inversions. Discuss corresponding 
improvements to Program S. 

34. [M10] Verify formula (14). 

35. [21 ] Write a MIX program to follow Program M, so that all lists are combined into 
a single list. Your program should set the LINK fields exactly as they would have been 
set by Program L. 

36. [1<¥] Assume that the byte size of MIX is 100, and that the sixteen example keys 
in Table 8 are actually 503000, 087000, 512000, . . . , 703000. Determine the running 
time of Programs L and M on this data, when M = 4. 

37. [M25] Let g n {z) be the probability generating function for inversions in a random 
permutation of n objects, Eq. 5.1.1-(n). Let gNu(z) be the corresponding generating 
function for the quantity B in Program M. Show that 


E 9nm(z) 

N>0 


M n w n 

N\ 


E 9n ^ 

n> 0 


M 


and use this formula to derive the variance of B. 


5.2.2 


SORTING BY EXCHANGING 105 


38. [HM23] (R. M. Karp.) Let F(x) be a distribution function for a probability 
distribution, with F(0) = 0 and F( 1) = 1. Given that the keys K\, K 2 , . . . , Ajv are 
independently chosen at random from this distribution, and that M = cN, where c 
is constant and N — t 00 , prove that the average running time of Program M is O(N) 
when F is sufficiently smooth. (A key K is inserted into list j when \_MK\ = j — 1; this 
occurs with probability F(j/M) — F((j — 1 )/M). Only the case F( x) = x, 0 < x < 1, 
is treated in the text.) 

39 . [HM16] If a program runs in approximately A/M + B units of time and uses 
C-\-M locations in memory, what choice of M gives the minimum time x space? 

► 40 . [ HM24 ] Find the asymptotic value of the average number of right-to-left maxima 
that occur in multiple list insertion, Eq. ( 15 ), when M = N/a for fixed a as N — > 00 . 
Carry out the expansion to an absolute error of 0(N~ 1 ), expressing your answer in 
terms of the exponential integral function E\(z) = J.°° e -t dt/t. 

41 . [HM26] (a) Prove that the sum of the first (^) elements of ( 10 ) is 0(p 2fc ). (b) Now 
prove Theorem I. 

42 . [HM43] Analyze the average behavior of shellsort when there are t = 3 increments 
h. g , and 1, assuming that h _!_ g. The first pass, h-sorting, obviously does a total of 
\N 2 /h + O(N) moves. 

a) Prove that the second pass, (/-sorting, does ^-(\/h — l/\/h)N 3 ^ 2 /g + 0(hN) 
moves. 

b) Prove that the third pass, 1 -sorting, does %/(h,g)N + 0(g 3 h 2 ) moves, where 


d=l j J 


! _ d\ h ~^ 

9 



► 43 . [25] Exercise 33 uses a sentinel to speed up Algorithm S, by making the test 
“» > 0” unnecessary in step S4. This trick does not apply to Algorithm D. Nevertheless, 
show that there is an easy way to avoid testing “i > 0” in step D5, thereby speeding 
up the inner loop of shellsort. 

44 . [M25] If 7r = a 1 . . . a n and n 1 = a\ . . . a' n are permutations of {1, . . . , n}, say that 
7 r < n' if the ith-largest element of {a!, . . . , m,} is less than or equal to the ith-largest 
element of {ai, . . . ,a' }, for 1 < i < j < n. (In other words, 7r < 7r' if straight insertion 
sorting of 7r is componentwise less than or equal to straight insertion sorting of n' after 
the first j elements have been inserted, for all j.) 

a) If 7 T is above n' in the sense of exercise 5.1.1-12, does it follow that 7 r < rr'7 

b) If 7 r < 7 r', does it follow that n R > n' R 7 

c) If 7 r < n', does it follow that n is above 7 r'? 


5.2.2. Sorting by Exchanging 

We come now to the second family of sorting algorithms mentioned near the 
beginning of Section 5.2: “exchange” or “transposition” methods that system- 
atically interchange pairs of elements that are out of order until no more such 
pairs exist. 

The process of straight insertion, Algorithm 5.2. IS, can be viewed as an 
exchange method: We take each new record Rj and essentially exchange it with 
its neighbors to the left until it has been inserted into the proper place. Thus 
the classification of sorting methods into various families such as “insertion,” 


106 SORTING 


5.2.2 


T— ■ 1 

CM CO 


LO 


co 




oo 


03 


03 

03 w 

Vi 

C/3 


C/3 


C/3 


03 


C/3 


C/1 

a 

01 03 

d ci 

Vi 

a 


a 


3 


a 


C/3 


Oh 

X X 

Oh 

0h 


Oh 


Oh 


Oh 


X 


703 908 

O 

908 908 

908 


908 


908 


908 


908 


908 

765 ° 703 

0 897 897 

897 


897 


897 


897 


897 


897 

677 0 765 

O 

O °703 .765 

765 


765 


765 


765 


765 


765 

612 0 677 

o 

0 ° 765 «/ 703 

703 


703 


703 


703 


703 


703 

509 0 612 

° 677 677 

o 

677 


677 


677 


677 


677 


677 

154 o 509 

° 612 0 653 

653 


653 


653 


653 


653 


653 

426 ° 154 

» 509 “612 

612 


612 


612 


612 


612 


612 

653 ° 426 

o 

o° 154 ° 509 

. o 

.512 


512 


512 


512 


512 


512 

275 0 653 

o 426 o° 154 

° 509 


509 


509 


509 


509 


509 

897 ° 275 

° 653/ 426 

O 

/ 1.54 


503 


503 


503 


503 


503 

170 ° 897/ 275 512 

° 

/ 426 

o 

o 

o 

154 

c 

, 426 


426 


426 


426 

908=°° 170 

0 512 oc/ 275 

0 503 J 

426 

J 

154 


,275 


275 


275 

061 o 512 

00*° 170 0 503 

y 275 


275 


275 y 

154 

0 

,170 


170 

512 061 

0 503 «/ 170 

170 


170 


170 


170 

J 

154 


154 

087 0 503 

y 061 0 087 

087 


087 


087 


087 


087 


087 

503 087 

087 y 061 

061 


061 


061 


061 


061 


061 


Fig. 14. The bubble sort in action. 

“exchange,” “selection,” etc., is not always clear-cut. In this section, we shall 
discuss four types of sorting methods for which exchanging is a dominant char- 
acteristic: exchange selection (the “bubble sort”); merge exchange (Batcher’s 
parallel sort); partition exchange (Hoare’s “quicksort”); and radix exchange. 

The bubble sort. Perhaps the most obvious way to sort by exchanges is to 
compare K\ with K 2 , interchanging R\ and R 2 if the keys are out of order; 
then do the same to records R 2 and f? 3 , f? 3 and R,\, etc. During this sequence 
of operations, records with large keys tend to move to the right, and in fact 
the record with the largest key will move up to become R N . Repetitions of the 
process will get the appropriate records into positions Rn-i, Rn- 2 , etc., so that 
all records will ultimately be sorted. 

Figure 14 shows this sorting method in action on the sixteen keys 503 087 
512 . . . 703; it is convenient to represent the file of numbers vertically instead of 
horizontally, with R N at the top and Ri at the bottom. The method is called 
“bubble sorting” because large elements “bubble up” to their proper position, 
by contrast with the “sinking sort” (that is, straight insertion) in which elements 
sink down to an appropriate level. The bubble sort is also known by more prosaic 
names such as “exchange selection” or “propagation.” 

After each pass through the file, it is not hard to see that all records above 
and including the last one to be exchanged must be in their final position, so 


5.2.2 


SORTING BY EXCHANGING 107 


they need not be examined on subsequent passes. Horizontal lines in Fig. 14 
show the progress of the sorting from this standpoint; notice, for example, that 
five more elements are known to be in final position as a result of Pass 4. On 
the final pass, no exchanges are performed at all. With these observations we 
are ready to formulate the algorithm. 

Algorithm B ( Bubble sort). Records R \ . . . . , frfy are rearranged in place; after 
sorting is complete their keys will be in order, K\ < • • • < K n . 

Bl. [Initialize BOUND.] Set BOUND <— N. (BOUND is the highest index for which 
the record is not known to be in its final position; thus we are indicating 
that nothing is known at this point.) 

B2. [Loop on j.] Set t t— 0. Perform step B3 for j = 1,2, . . . , BOUND — 1, and 
then go to step B4. (If BOUND = 1, this means go directly to B4.) 

B3. [Compare/exchange Rj:R j+1 .} If Kj > K j+i , interchange Rj ++ R :l + i and 
set t j. 

B4. [Any exchanges?] If t = 0, terminate the algorithm. Otherwise set BOUND t 
and return to step B2. | 



Fig. 15. Flow chart for bubble sorting. 


Program B ( Bubble sort). As in previous MIX programs of this chapter, we 
assume that the items to be sorted are in locations INPUT+1 through INPUT+N. 


rll 

= t; rI2 

= j- 




01 

START 

ENT1 

N 

1 

Bl. Initialize BOUND, t <— N. 

02 

1H 

ST1 

BOUND (1 : 2) 

A 

BOUND <- t. 

03 


ENT2 

1 

A 

B2. Loop on i. i «— 1. 

04 


ENT1 

0 

A 

t i — 0. 

05 


JMP 

BOUND 

A 

Exit if j > BOUND. 

06 

3H 

LDA 

INPUT ,2 

C 

B3. Co in pare /ex chan ge R, : R, + i . 

07 


CMP A 

INPUT+1 ,2 

C 


08 


JLE 

2F 

c 

No exchange if Kj < Kj+ 1 . 

09 


LDX 

INPUT+1 ,2 

B 

Rj+i 

10 


STX 

INPUT ,2 

B 

► Rj- 

11 


STA 

INPUT+1, 2 

B 

(old Rj) — » Rj+i. 

12 


ENT1 

0,2 

B 

t <r- j. 

13 

2H 

INC2 

1 

C 

3 +“ j + 1. 

14 

BOUND 

ENTX 

-* ,2 

A + C 

rX 4— j — BOUND. [Instruction modified] 

15 


JXN 

3B 

A + C 

Do step B3 for 1 < j < BOUND. 

16 

4H 

J1P 

IB 

A 

B4. Any exchanges? To B2 if t > 0. 1 




108 SORTING 


5.2.2 


Analysis of the bubble sort. It is quite instructive to analyze the running 
time of Algorithm B. Three quantities are involved in the timing: the number 
of passes, A; the number of exchanges, B; and the number of comparisons, C. If 
the input keys are distinct and in random order, we may assume that they form 
a random permutation of {l,2,...,n}. The idea of inversion tables (Section 
5.1.1) leads to an easy way to describe the effect of each pass in a bubble sort. 

Theorem I. Let ai a 2 . ■ . a n be a permutation of { 1,2,..., n}, and let bi b 2 . . . b n 
be the corresponding inversion table. If one pass of the bubble sort, Algorithm B, 
changes a\ ■ ■ . a n to the permutation a[ a ' 2 . . . a' n , the corresponding inversion 
table b[ b' 2 ... b' n is obtained from Ip b 2 . . . b n by decreasing each nonzero entry 
by 1. 

Proof. If ai is preceded by a larger element, the largest preceding element is 
exchanged with it, so 6 a . decreases by 1. But if a, is not preceded by a larger 
element, it is never exchanged with a larger element, so b a . remains 0. | 

Thus we can see what happens during a bubble sort by studying the sequence 
of inversion tables between passes. For example, the successive inversion tables 
corresponding to Fig. 14 are 

p 3183450403223210 

2072340302112100 

I — y QQQ O / \ 

1061230201001000 

P^gg ^ 

0050120100000000 

and so on. If 6j b 2 . . . b n is the inversion table of the input permutation, we must 
therefore have 


A= 1 + max {bi,b 2 , ■ ■ ■ ,b n ), 

0) 

B — b\ + + • • ■ + b n , 

(3) 

C = Cl + C2 + • • • + Ca, 

(4) 


where Cj is the value of BOUND — 1 at the beginning of pass j. In terms of the 
inversion table, 

c j = max {bi + i | bi > j - 1} - j ( 5 ) 

(see exercise 5). In example (i) we therefore have A = 9, B = 41, C = 15 + 14 + 
13+12 + 7 + 5 + 4 + 3 + 2 = 75. The total MIX sorting time for Fig. 14 is 960u. 

The distribution of B (the total number of inversions in a random permu- 
tation) is very well-known to us by now; so we are left with A and C to be 
analyzed. 

The probability that A < k is 1/n! times the number of inversion tables 
having no components > k, namely k n ~ k k\, when 1 < k < n. Hence the 
probability that exactly k passes are required is 

A k = 1 ( k n ~ k kl - (k - 1 ) n ~ k+1 {k - 1 )! ). 


( 6 ) 


5.2.2 


SORTING BY EXCHANGING 


109 


The mean value kAk can now be calculated; summing by parts, it is 


k n ~ k k\ . 

= n + 1 - 2_^ — — — = n + 1 - P(n), 


(7) 


fc =0 


where P(n) is the function whose asymptotic value was found to be ^tvti/2 — | + 
0(l/y/n) in Eq. 1 . 2 . 11 . 3 -( 24 ). Formula ( 7 ) was stated without proof by E. H. 
Friend in JACM 3 (1956), 150; a proof was given by Howard B. Demuth [Ph.D. 
Thesis (Stanford University, October 1956), 64-68], For the standard deviation 
of A, see exercise 7. 

The total number of comparisons, C, is somewhat harder to handle, and we 
will consider only C ave . For fixed n, let fj(k) be the number of inversion tables 
bi . . . b n such that for 1 < i < n we have either 6, < j — 1 or b, + i — j < k; then 

fj(k) = ( j + k)\ ( j - l) n ~ 3 ~ k , for 0 < k < n - j. ( 8 ) 


(See exercise 8 .) The average value of Cj in ( 5 ) is ()C k(fj(k) — fj(k — l)))/n!; 
summing by parts and then summing on j leads to the formula 





n! 


E m = { 


1 <j<n 
0 <k<n—j 


n + 1 
2 


1 

n! 


E s!r "“ s - 

0 <r<s<n 


(9) 


Here the asymptotic value is not easy to determine, and we shall return to it at 
the end of this section. 

To summarize our analysis of the bubble sort, the formulas derived above 
and below may be written as follows: 

A — (min 1, ave N — \J ttN/2 + 0(1), max TV); ( 10 ) 

B = (min 0, ave j(1V 2 - N), max |(77 2 - IV )); ( 11 ) 


C= (min N - 1, ave \ (N 2 - N In iV — (7 + In 2 - 1)1V) +0(v / lV), 

max |(iV 2 - N)). ( 

12 ) 

In each case the minimum occurs when the input is already in order, and the 
maximum occurs when it is in reverse order; so the MIX running time is 8 d. + 
75 + 80+1 = (min 8IV+1, ave 5.751V 2 + 0(N log N), max 7.51V 2 + 0.5IV + l) . 

Refinements of the bubble sort. It took a good deal of work to analyze the 
bubble sort; and although the techniques used in the calculations are instructive, 
the results are disappointing since they tell us that the bubble sort isn’t really 
very good at all. Compared to straight insertion (Algorithm 5. 2. IS), bubble 
sorting requires a more complicated program and takes more than twice as long! 

Some of the bubble sort’s deficiencies are easy to spot. For example, in 
Fig. 14, the first comparison in Pass 4 is redundant, as are the first two in 
Pass 5 and the first three in Passes 6 and 7. Notice also that elements can never 
move to the left more than one step per pass; so if the smallest item happens 
to be initially at the far right we are forced to make the maximum number of 


110 SORTING 


5.2.2 


703 

o 

908 

908 

908 

908 

908 

908 

908 

765 ° 

7031 % 

765 

o 

897 

897 

897 

897 

897 

677 ° 

o 

765 °° 

703 ° 

765 

765 

765 

765 

765 

612 ° 
o 

677 

677 ° 

O 

703 

703 

703 

703 

703 

509 ° 

O 

612 

612 ° 
o 

677 

677 

677 

677 

677 

154 o 

509 

509 « 

O 

612 

612 

653 

653 

653 

426 0 ° 

154% 

O 

426 = 

509 

509 o° 

o 

612 

612 

612 

653 ° 

o 

426 °° 

O 

653 : 

O 

426 

653 J 

509 

512 

512 

275 ° 

O 

653 : 

275 ° 

o 

653 °° 

426 

o 

512 °° 

509 

509 

897 ° 

275 ° 

897 § 

275 

512 y 

426 »%), 

O, 

503 

503 

O O 

O 

897 o 

170 ° 

512 °° 

275 0 

503 ° 

426 

426 

908 / 

170 ° 

o 

512«A°° 

170°% % 

503 y 

275 

275 

275 

061 

O 

o 

512 0 

154 o 

503 °° 

170 

170 

170 

170 

512 

061 <*), 
o 

503 y 

154 

154 

154 

154 

154 

087 G 

503 \ 

087 

087 

087 

087 

087 

087 

503 J 

087 ° 

061 

061 

061 

061 

061 

061 


Fig. 16. The cocktail-shaker short [shic]. 


comparisons. This suggests the “cocktail-shaker sort,” in which alternate passes 
go in opposite directions (see Fig. 16). The average number of comparisons is 
slightly reduced by this approach. K. E. Iverson [A Programming Language 
(Wiley, 1962), 218-219] made an interesting observation in this regard: If j is 
an index such that R } and Rj+i are not exchanged with each other on two 
consecutive passes in opposite directions, then Rj and R j+1 must be in their 
final position, and they need not enter into any subsequent comparisons. For 
example, traversing 432186975 from left to right yields 32146875 9: 
no interchange occurred between R A and R$. When we traverse the latter 
permutation from right to left, we find R A still less than (the new) R 5 , so we 
may immediately conclude that R 4 and R 5 need not participate in any further 
comparisons. 

But none of these refinements lead to an algorithm better than straight 
insertion; and we already know that straight insertion isn’t suitable for large N. 
Another idea is to eliminate most of the exchanges; since most elements simply 
shift left one step during an exchange, we could achieve the same effect by viewing 
the array differently, shifting the origin of indexing! But the resulting algorithm 
is no better than straight selection, Algorithm 5.2.3S, which we shall study later. 

In short, the bubble sort seems to have nothing to recommend it, except a 
catchy name and the fact that it leads to some interesting theoretical problems. 

Batcher s parallel method. If we are going to have an exchange algorithm 
whose running time is faster than order N 2 , we need to select some nonadjacent 
pairs of keys (A), I\ , ) for comparisons; otherwise we will need as many exchanges 


5.2.2 


SORTING BY EXCHANGING 111 


as the original permutation has inversions, and the average number of inversions 
is |(A r2 — N). An ingenious way to program a sequence of comparisons, looking 
for potential exchanges, was discovered in 1964 by K. E. Batcher [see Proc. 
AFIPS Spring Joint Computer Conference 32 (1968), 307 314]. His method is 
not at all obvious; in fact, a fairly intricate proof is needed just to show that it 
is valid, since comparatively few comparisons are made. We shall discuss two 
proofs, one in this section and another in Section 5.3.4. 



Fig. 17. Algorithm M. 


Batcher’s sorting scheme is similar to shellsort, but the comparisons are 
done in a novel way so that no propagation of exchanges is necessary. We can, 
for instance, compare Table 1 (on the next page) to Table 5.2. 1-3; Batcher’s 
method achieves the effect of 8-sorting, 4-sorting, 2-sorting, and 1-sorting, but 
the comparisons do not overlap. Since Batcher’s algorithm essentially merges 
pairs of sorted subsequences, it may be called the “merge exchange sort.” 

Algorithm M ( Merge exchange). Records R\, . . . , Rjsr are rearranged in place; 
after sorting is complete their keys will be in order, Ki < ■ ■ • < Kpf. We assume 
that N > 2. 

Ml. [Initialize p.} Set p <— 2 t_1 , where t = [IgA] is the least integer such that 
2* > N. (Steps M2 through M5 will be performed for p = 2 <_1 , 2 t “ 2 , . . . , 1.) 
M2. [Initialize g, r, d.] Set q 2 t ~ 1 , r <— 0, d 4— p. 

M3. [Loop on i.] For all i such that 0 < i < N — d and i & p = r, do step M4. 
Then go to step M5. (Here i p means the “bitwise and” of the binary 
representations of i and p\ each bit of the result is zero except where both 
i and p have 1-bits in corresponding positions. Thus 13 & 21 = (1101)2 & 
(10101)2 = (00101)2 = 5. At this point, d is an odd multiple of p, and p is a 
power of 2, so that ik.p / (i + d) &p; it follows that the actions of step M4 
can be done for all relevant i in any order, even simultaneously.) 

M4. [Compare/exchange Ri + i : Ri + d+i] If K t+ i > Ki + d+ 1 , interchange the 
records R l+X o R i+d +i- 

M5. [Loop on q .] If q ^ p, set d 4- q — p, q 4 — q/2, r p, and return to M3. 
M6. [Loop on p.] (At this point the permutation Ki K 2 ■ ■ . K jv is p-ordered.) 
Set p [p/2j. If p > 0, go back to M2. | 








112 SORTING 


5.2.2 


Table 1 

MERGE-EXCHANGE SORTING (BATCHER’S METHOD) 


p q r d 


503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 


8 8 0 8 


503 087 154 061 612 170 765 275 653 426 512 509 908 677 897 703 


4 8 0 4 


503 087 154 061 612 170 765 275 653 426 512 509 908 677 897 703 


4 4 4 4 


503 087 154 061 612 170 512 275 653 426 765 509 908 677 897 703 


2 8 0 2 


154 061 503 087 512 170 612 275 653 426 765 509 897 677 908 703 

154 061 503 087 512 170 612 275 653 426 765 509 897 677 908 703 

2 2 2 2 

154 061 503 087 512 170 612 275 653 426 765 509 897 677 908 703 

1801 wwwwwwww 

061 154 087 503 170 512 275 612 426 653 509 765 677 897 703 908 

061 154 087 503 170 512 275 612 426 653 509 765 677 897 703 908 

12 13 

061 154 087 275 170 426 503 509 512 653 612 703 677 897 765 908 

1111 W W W W W W W 

061 087 154 170 275 426 503 509 512 612 653 677 703 765 897 908 


Table 1 illustrates the method for N = 16. Notice that the algorithm sorts N 
elements essentially by sorting R lt R 3 , R 5 , . . . and R 2 ,R 4 ,R 6 ,. . . independently; 
then we perform steps M2 through M5 for p = 1, in order to merge the two 
sorted sequences together. 

In order to prove that the magic sequence of comparison/exchanges specified 
in Algorithm M actually will sort all possible input files R 4 R 2 . . . Rn, we must 
show only that steps M2 through M5 will merge all 2-ordered files f?i R 2 . . . R N 
when p = 1. For this purpose we can use the lattice-path method of Section 
5.2.1 (see Fig. 11 on page 87); each 2-ordered permutation of {1,2,..., N} 
corresponds uniquely to a path from (0,0) to ([7/V/2], (1V/2J) in a lattice di- 
agram. Figure 18(a) shows an example for N = 16, corresponding to the 
permutation 1 3 2 4 10 5 11 6 13 7 14 8 15 9 16 12. When we perform step M3 with 
P — lj q = 2 t_1 , r = 0, d — 1, the effect is to compare (and possibly exchange) 
Ri'.R'i, A 3 : Ri- etc. This operation corresponds to a simple transformation of 
the lattice path, “folding” it about the diagonal if necessary so that it never 
goes above the diagonal. (See Fig. 18(b) and the proof in exercise 10.) The 


5.2.2 


SORTING BY EXCHANGING 


113 


next iterations of step M3 have p = r = 1 , and d = 2 t ~ 1 — 1, 2*~ 2 — 1 , . . . , 1; 
their effect is to compare/exchange R 2 :R 2+ d, Ri'-R^+d, etc., and again there 
is a simple lattice interpretation: The path is “folded” about a line \{d + 1) 
units below the diagonal. See Fig. 18(c) and (d); eventually we get to the 
path in Fig. 18(e), which corresponds to a completely sorted permutation. This 
completes a “geometric proof” that Batcher’s algorithm is valid; we might call 
it sorting by folding! 



Fig. 18. A geometric interpretation of Batcher’s method, N = 16. 


A MIX program for Algorithm M appears in exercise 12. Unfortunately the 
amount of bookkeeping needed to control the sequence of comparisons is rather 
large, so the program is less efficient than other methods we have seen. But it has 
one important redeeming feature: All comparison/exchanges specified by a given 
iteration of step M3 can be done simultaneously , on computers or networks that 
allow parallel computations. With such parallel operations, sorting is completed 
in 2 |"lgiV"| (fig N~\ + 1) steps, and this is about as fast as any general method 
known. For example, 1024 elements can be sorted in only 55 parallel steps by 
Batcher’s method. The nearest competitor is Pratt’s method (see exercise 5.2.1- 
30), which uses either 40 or 73 steps, depending on how we count; if we are 
willing to allow overlapping comparisons as long as no overlapping exchanges 
are necessary, Pratt’s method requires only 40 comparison/exchange cycles to 
sort 1024 elements. For further comments, see Section 5.3.4. 

Quicksort. The sequence of comparisons in Batcher’s method is predetermined; 
we compare the same pairs of keys each time, regardless of what we may have 
learned about the file from previous comparisons. The same is largely true of the 
bubble sort, although Algorithm B does make limited use of previous knowledge 
in order to reduce its work at the right end of the file. Let us now turn to a 
quite different strategy, which uses the result of each comparison to determine 
what keys are to be compared next. Such a strategy is inappropriate for parallel 
computations, but on computers that work serially it can be quite fruitful. 

The basic idea of the following method is to take one record, say Ri , and to 
move it to the final position that it should occupy in the sorted file, say position s. 
While determining this final position, we will also rearrange the other records so 
that there will be none with greater keys to the left of position s, and none with 
smaller keys to the right. Thus the file will have been partitioned in such a way 



114 SORTING 


5.2.2 


that the original sorting problem is reduced to two simpler problems, namely 
to sort i?i . . . R s -i and (independently) to sort R s +i - ■ ■ Rn ■ We can apply the 
same technique to each of these subfiles, until the job is done. 

There are several ways to achieve such a partitioning into left and right 
subfiles; the following scheme due to R. Sedgewick seems to be best, for reasons 
that will become clearer when we analyze the algorithm: Keep two pointers, 
i and j, with i = 2 and j — N initially. If R, is eventually supposed to be 
part of the left-hand subfile after partitioning (we can tell this by comparing 
Ki with A'j), increase i by 1, and continue until encountering a record Fi, that 
belongs to the right-hand subfile. Similarly, decrease j by 1 until encountering 
a record Rj belonging to the left-hand subfile. If i < j, exchange H, with R :l : 
then move on to process the next records in the same way, “burning the candle 
at both ends” until i > j. The partitioning is finally completed by exchanging 
Rj with R i. For example, consider what happens to our file of sixteen numbers: 


Initial file: 

[503 

087 

512 

061 

908 

170 

897 

275 

653 

426 

154 

509 

612 

677 

765 

703 ] 

1st exchange: 

503 

087 

512 

061 

908 

170 

897 

275 

653 

426 

154 

509 

612 

677 

765 

703 

2nd exchange: 

503 

087 

154 

061 

908 

170 

897 

275 

653 

426 

512 

509 

612 

677 

765 

703 

3rd exchange: 

503 

087 

154 

061 

426 

170 

897 

275 

653 

908 

512 

509 

612 

677 

765 

703 

Pointers cross: 

503 

087 

154 

061 

426 

170 

275 

897 

653 

908 

512 

509 

612 

677 

765 

703 

Partitioned file: 

[275 

087 

154 

061 

426 

170] 

503 

[897 

653 

908 

512 

509 

612 

677 

765 

703] 


t t 
j i 


(In order to indicate the positions of i and j, keys K t and Kj are shown here in 
boldface type.) 

Table 2 shows how our example file gets completely sorted by this approach, 
in 11 stages. Brackets indicate subfiles that still need to be sorted; double 
brackets identify the subfile of current interest. Inside a computer, the current 
subfile can be represented by boundary values (l,r), and the other subfiles by 
a stack of additional pairs ( lk,rk ). Whenever a file is subdivided, we put the 
longer subfile on the stack and commence work on the shorter one, until we reach 
trivially short files; this strategy guarantees that the stack will never contain 
more than lgTV entries (see exercise 20). 

The sorting procedure just described may be called partition- exchange sort- 
ing ; it is due to C. A. R. Hoare, whose interesting paper [Comp. J. 5 (1962), 
10-15] contains one of the most comprehensive accounts of a sorting method that 
has ever been published. Hoare dubbed his method “quicksort,” and that name 
is not inappropriate, since the inner loops of the computation are extremely fast 
on most computers. All comparisons during a given stage are made against the 
same key, so this key may be kept in a register. Only a single index needs to 
be changed between comparisons. Furthermore, the amount of data movement 


5.2.2 


SORTING BY EXCHANGING 115 


Table 2 

QUICKSORTING 


(Z,r) Stack 


[503 

087 

512 

061 

908 

170 

897 

275 

653 

426 

154 

509 

612 

677 

765 

703] 

(1-16) 


[275 

087 

154 

061 

426 

170] 

503 

[897 

653 

908 

512 

509 

612 

677 

765 

703] 

(1.6) 

(8,16) 

[170 

087 

154 

061] 

275 

426 

503 

[897 

653 

908 

512 

509 

612 

677 

765 

703] 

(1.4) 

(8,16) 

[061 

087 

154J 

170 

275 

426 

503 

[897 

653 

908 

512 

509 

612 

677 

765 

703] 

(1.3) 

(8,16) 

061 

[087 

154] 

170 

275 

426 

503 

[897 

653 

908 

512 

509 

612 

677 

765 

703] 

(2,3) 

(8,16) 

061 

087 

154 

170 

275 

426 

503 

[897 

653 

908 

512 

509 

612 

677 

765 

703] 

(8,16) 

— 

061 

087 

154 

170 

275 

426 

503 

[765 

653 

703 

512 

509 

612 

677] 

897 

908 

(8,14) 

— 

061 

087 

154 

170 

275 

426 

503 

[677 

653 

703 

512 

509 

612] 

765 

897 

908 

(8-13) 

— 

061 

087 

154 

170 

275 

426 

503 

[509 

653 

612 

512] 

677 

703 

765 

897 

908 

(8,11) 

— 

061 

087 

154 

170 

275 

426 

503 

509 

[653 

612 

512] 

677 

703 

765 

897 

908 

(9,11) 

— 

061 

087 

154 

170 

275 

426 

503 

509 

[512 

612] 

653 

677 

703 

765 

897 

908 

(9,10) 

— 

061 

087 

154 

170 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

- 

- 


is quite reasonable; the computation in Table 2, for example, makes only 17 
exchanges. 

The bookkeeping required to control i, j, and the stack is not difficult, but 
it makes the quicksort partitioning procedure most suitable for fairly large N. 
Therefore the following algorithm uses another strategy after the subfiles have 
become short. 

Algorithm Q ( Quicksort ). Records R x , . . . , R jv are rearranged in place; after 
sorting is complete their keys will be in order, K x < ■ ■ ■ < K N . An auxiliary 
stack with at most [lg IV J entries is needed for temporary storage. This algorithm 
follows the quicksort partitioning procedure described in the text above, with 
slight modifications for extra efficiency: 

a) We assume the presence of artificial keys K 0 = — oo and Kn+i = +oc such 
that 

Ko < Ki < Abv+i for 1 < i < N. ( 13 ) 

(Equality is allowed.) 

b) Subfiles of M or fewer elements are left unsorted until the very end of the 
procedure; then a single pass of straight insertion is used to produce the final 
ordering. Here M > 1 is a parameter that should be chosen as described in 
the text below. (This idea, due to R. Sedgewick, saves some of the overhead 
that would be necessary if we applied straight insertion directly to each small 
subfile, unless locality of reference is significant.) 

c) Records with equal keys are exchanged, although it is not strictly necessary 
to do so. (This idea, due to R. C. Singleton, keeps the inner loops fast and 
helps to split subfiles nearly in half when equal elements are present; see 
exercise 18.) 

Ql. [Initialize.] If N < M , go to step Q9. Otherwise set the stack empty, and 
set l «— 1, r N. 


116 SORTING 


5.2.2 



Fig. 19. Partition-exchange sorting (quicksort). 

Q2. [Begin new stage.] (We now wish to sort the subfile Ri . . . R r : from the 
nature of the algorithm, we have r > l + M, and A)_ 1 < Ki < K r+ i for 
l < i < r.) Set i 4— l, j <— r + 1; and set K -f- Ki. (The text below discusses 
alternative choices for K that might be better.) 

Q3. [Compare Ki : K.} (At this point the file has been rearranged so that 

Kk < K for l — 1 < k < i, K < Kk for j < k < r + 1; (14) 

and l < i < j.) Increase i by 1; then if A', < AT, repeat this step. (Since 
Kj > K, the iteration must terminate with i < j.) 

Q4. [Compare K :Kj .] Decrease j by 1; then if K < Kj, repeat this step. (Since 
K > Ki-\, the iteration must terminate with j > i — 1.) 

Q5. [Test i:j.} (At this point, (14) holds except for k = i and k = j; also 

Ki > K > Kj , and r > j > i — 1 > l.) If j < i, interchange Ri Rj and 

go to step Q7. 

Q6. [Exchange.] Interchange R, t O Rj and go back to step Q3. 

Q7. [Put on stack.] (Now the subfile Ri ... Rj ... R r has been partitioned so 
that Kk < Kj for Z — 1 < k < j and Kj < Kk for j < k < r + 1.) If 
1 — j > j — l > M, insert (j’+l, r ) on top of the stack, set r «— j — 1, and go 
to Q2. If j — l > r — j > M, insert (l,j — 1) on top of the stack, set l <— j + 1, 
and go to Q2. (Each entry (a, b) on the stack is a request to sort the subfile 
R a ... Rb at some future time.) Otherwise if r — j > M > j — l, set l «— j + 1 
and go to Q2; or if j — l > M > r — j, set r <— j — 1 and go to Q2. 

Q8. [Take off stack.] If the stack is nonempty, remove its top entry (/'. r'), set 
l 4 — l', r <— r', and return to step Q2. 

Q9. [Straight insertion sort.] For j = 2, 3, . . . , A) if Ay_i > Kj do the following 

operations: Set K 4— Kj, R 4— Rj, i 4— j — 1; then set R, + 1 4 — R., and 

i 4— i — 1 one or more times until Ki < K; then set Ri+\ 4— R. (This 







5.2.2 


SORTING BY EXCHANGING 117 


is Algorithm 5. 2. IS, modified as suggested in exercise 5.2.1-10 and answer 
5.2.1-33. Step Q9 may be omitted if M = 1. Caution: The final straight 
insertion might conceal bugs in steps Q1-Q8; don’t trust an implementation 
just because it gives the correct answers!) | 

The corresponding MIX program is rather long, but not complicated; in fact, 
a large part of the coding is devoted to step Q7, which just fools around with 
the variables in a very straightforward way. 

Program Q (Quicksort). Records to be sorted appear in locations INPUT+1 
through INPUT+N; assume that locations INPUT and INPUT+N+1 contain, respec- 
tively, the smallest and largest values possible in MIX. The stack is kept in 
locations STACK+1, STACK+2, . . . ; see exercise 20 for the exact number of locations 
to set aside for the stack. rI2 = l, rI3 = r, rI4 = i, rI5 = j, rI6 = size of stack, 


rA 

= K = 

R. We 

i assume that N > M. 



A 

EQU 

2:3 


First component of stack entry. 


B 

EQU 

4:5 


Second component of stack entry. 

01 

START 

ENT6 

0 

1 

Ql. Initialize. Set stack empty. 

02 


ENT2 

1 

1 

1 <- 1. 

03 


ENT3 

N 

1 

r <— N. 

04 

2H 

ENT5 

1,3 

A 

0.2. Begin new stage, i «— r + 1. 

05 


LDA 

INPUT ,2 

A 

K 4 - Ki. 

06 


ENT4 

1,2 

A 

i 4 — / T 1 . 

07 


JMP 

OF 

A 

To Q3 omitting “i «— i + 1” . 

08 

6H 

LDX 

INPUT, 4 

B 

06. Exchange. 

09 


ENT1 

INPUT ,4 

B 


10 


MOVE 

INPUT, 5 

B 


11 


STX 

INPUT, 5 

B 

Ri ^ Rj . 

12 

3H 

INC4 

1 

C' - A 

Q3. Compare Kn K. i i + 1. 

13 

OH 

CMPA 

INPUT, 4 

C' 


14 


JG 

3B 

C' 

Repeat if A > Ki. 

15 

4H 

DEC5 

1 

C-C 

04. Compare K : K , . i +- ? — 1. 

16 


CMPA 

INPUT, 5 

C-C 


17 


JL 

4B 

C-C' 

Repeat if K < Kj. 

18 

5H 

ENTX 

0,5 

B + A 

0.5. Test i : i. 

19 


DECX 

0,4 

B + A 


20 


JXP 

6B 

B + A 

To Q6 if j > i. 

21 


LDX 

INPUT, 5 

A 


22 


STX 

INPUT, 2 

A 

Ri 4— Rj • 

23 


STA 

INPUT , 5 

A 

Rj i — R. 

24 

7H 

ENT4 

0,3 

A 

Q7. Put on stack. 

25 


DEC4 

M, 5 

A 

rI4 r — j — M. 

26 


ENT1 

0,5 

A 


27 


DEC1 

M, 2 

A 

rll 4— j — l — M. 

28 


ENTA 

0,4 

A 


29 


DECA 

0,1 

A 


30 


JANN 

IF 

A 

Jump if r — j > j — l. 

31 


J1NP 

8F 

A' 

To Q8 if M > j — l > r — j . 

32 


J4NP 

3F 

S' + A" 

Jump if j — l>M>r — j. 


118 SORTING 5.2.2 


33 


INC6 

1 

S' 

(Now j - l > r - j > M.) 

34 


ST2 

STACK, 6 (A) 

S' 


35 


ENTA 

-1,5 

S' 


36 


STA 

STACK, 6(B) 

S' 

( l , j — 1) => stack. 

37 

4H 

ENT2 

1,5 

S' + A'" 

li-j + l. 

38 


JMP 

2B 

S' + A'" 

To Q2. 

39 

1H 

J4NP 

8F 

A- A' 

To Q8 if M > r — j > j — l. 

40 


J1NP 

4B 

S - S' + A"' 

Jump if r — j > M > j — l. 

41 


INC6 

1 

S -S' 

(Now r — j > j — l > M.) 

42 


ST3 

STACK, 6(B) 

S - S' 


43 


ENTA 

1,5 

S- S' 


44 


STA 

STACK, 6 (A) 

S- S' 

(j+1, r ) =+ stack. 

45 

3H 

ENT3 

-1,5 

S - S' + A" 

r <r~ j — 1. 

46 


JMP 

2B 

S - S' + A" 

To Q2. 

47 

8H 

LD2 

STACK, 6 (A) 

5+1 

Q8. Take off stack. 

48 


LD3 

STACK, 6(B) 

5 + 1 


49 


DEC6 

1 

5+1 

( l , r) <+ stack. 

50 


J6NN 

2B 

5 + 1 

To Q2 if stack wasn’t empty. 

51 

9H 

ENT5 

2-N 

1 

Q9. Straight insertion sort, i <— 2 

52 

2H 

LDA 

INPUT+N.5 

N - 1 

K t — Kj, R t — Rj . 

53 


CMP A 

INPUT+N-1 , 5 

N — 1 

(In this loop, rI5 = j — N.) 

54 


JGE 

6F 

N - 1 

Jump if K > Kj-\. 

55 

3H 

ENT4 

N-1,5 

D 

i +- j ~ 1. 

56 

4H 

LDX 

INPUT, 4 

E 


57 


STX 

INPUT+1 ,4 

E 

Ri + l ^ Ri ■ 

58 


DEC4 

1 

E 

i «— i — 1. 

59 


CMPA 

INPUT, 4 

E 


60 


JL 

4B 

E 

Repeat if K < K, . 

61 

5H 

STA 

INPUT+1, 4 

D 

Ri + l i — R. 

62 

6H 

INC5 

1 

N- 1 


63 


J5NP 

2B 

N- 1 

2 < j < N. | 


Analysis of quicksort. The timing information shown with Program Q is not 
hard to derive using Kirchhoff’s conservation law (Section 1 . 3 . 3 ) and the fact 
that everything put onto the stack is eventually removed again. Kirchhoff’s law 
applied at Q2 also shows that 

A = 1 + (S" + A"') + (S-S' + A") + S = 2S + 1 + A" + A'", ( 15 ) 

hence the total running time comes to 

24A + 11 B + 4C + 3D + 8E + 7N + 95 units, 

where 

A = number of partitioning stages; 

B = number of exchanges in step Q 6 ; 

C = number of comparisons made while partitioning; 

D = number of times Kj _ j > Kj during straight insertion (step Q9); 

E = number of inversions removed by straight insertion; 

S = number of times an entry is put on the stack. ( 16 ) 


5.2.2 


SORTING BY EXCHANGING 119 


By analyzing these six quantities, we will be able to make an intelligent choice of 
the parameter M that specifies the “threshold” between straight insertion and 
partitioning. The analysis is particularly instructive because the algorithm is 
rather complex; the unraveling of this complexity makes a particularly good 
illustration of important techniques. However, nonmathematical readers are 
advised to skip to Eq. ( 25 ). 

As in most other analyses of this chapter, we shall assume that the keys to 
be sorted are distinct; exercise 18 indicates that equalities between keys do not 
seriously harm the efficiency of Algorithm Q, and in fact they seem to help it. 
Since the method depends only on the relative order of the keys, we may as well 
assume that they are simply {1,2,..., N} in some order. 

We can attack this problem by considering the behavior of the very first 
partitioning stage, which takes us to Q7 for the first time. Once this partitioning 
has been achieved, both of the subfiles Ri . . . Rj-i and Rj+i ■ . . Rn will be in 
random order if the original file was in random order, since the relative order of 
elements in these subfiles has no effect on the partitioning algorithm. Therefore 
the contribution of subsequent partitionings can be determined by induction 
on N. (This is an important observation, since some alternative algorithms that 
violate this property have turned out to be significantly slower; see Computing 
Surveys 6 (1974), 287-289.) 

Let s be the value of the first key, K\, and assume that exactly t of the first s 
keys {K \ , . . . , K s } are greater than s. (Remember that the keys being sorted are 
the integers {1,2,..., N}.) If s — 1, it is easy to see what happens during the 
first stage of partitioning: Step Q3 is performed once, step Q4 is performed N 
times, and then step Q5 takes us to Q7. So the contributions of the first stage in 
this case are A = 1, B = 0, C = N + 1. A similar but slightly more complicated 
argument when s > 1 (see exercise 21 ) shows that the contributions of the first 
stage to the total running time are, in general, 

A = 1, B = t, C = N + 1, for 1 < s < N. ( 17 ) 

To this we must add the contributions of the later stages, which sort subfiles of 
s — 1 and N — s elements, respectively. 

If we assume that the original file is in random order, it is now possible 
to write down formulas that define the generating functions for the probability 
distributions of A, B, . . . , S (see exercise 22). But for simplicity we shall consider 
here only the average values of these quantities, An, Bn, ■ ■ ■ , Sn, as functions 
of N. Consider, for example, the average number of comparisons, Cn, that occur 
during the partitioning process. When N < M, Cn = 0. Otherwise, since any 
given value of s occurs with probability 1/N, we have 

1 N 

Cn = (N + 1 + (Vi + Cjv-s) 

S = 1 

= N+1 + ^ E for N > M. ( 18 ) 

0 <k<N 


120 SORTING 


5 . 2.2 


Similar formulas hold for other quantities A N , B N , D N , E N , S N (see exercise 23 ). 
There is a simple way to solve recurrence relations of the form 

2 


— fn + ^ 

n. 




for n > 


(19) 


0 <fc<n 


The first step is to get rid of the summation sign: Since 

(n+ l)x n+1 = (n+ l)/ n+1 + 2 ^2 x k , 

0 </c<n 


nx n 


7lfn “b 2 ^ ^ X/j, 

0 <k<n 

we may subtract, obtaining 

(n + l)x n+1 - nx„ - g n + 2 x n , where g n = (n + 1 )/ n+1 - n/„ 
Now the recurrence takes the much simpler form 

(n + l)x n+ i = (n + 2)x„ + g n , for n>m. 

Any recurrence relation that has the general form 

1 — b n X n T gn 


( 20 ) 


(2l) 


can be reduced to a summation if we multiply both sides by the “summation 
factor” cio a\ . . . a n -i/bo bi . . . b n ; we obtain 


■ i • • • &n— 1 

2/n+i = Vn + c„, where y n = — ; x„ 


do • • • ®n— 1 / \ 

C n = — 7 — (22) 


bo...b n - 1 6„ 61 . . . 6 n 

In our case (20), the summation factor is simply n!/(n + 2 )! = l/(n + l)(n + 2 ), 
so we find that the simple relation 

(n + l)/„+i - nf n 


l n -\- 1 


n + 1 


for n > m. 


(23) 


(n + l)(n + 2) 

is a consequence of (19). 

For example, if we set /„ = 1 /n, we get the unexpected result x„/(n + 1 ) = 
x m l (m + 1 ) for all n > m. If we set /„ = n + 1 , we get 

x„/(n+ 1) = 2/(n + 1) + 2/nd h 2/(?n + 2) + x m /{m + 1) 

^ (Hn +1 Hm-\- 1 ) T x m / (m + 1 ) , 

for all n > m. Thus we obtain the solution to (18) by setting m — M + 1 and 
x n = 0 for n < M; the required formula is 

Cjv = (IV + 1) (2Hn + i — 2Hm+2 + 1) 

N+l' 


2 (N + 1 ) In 


M + 2 


for N > M. 


(24) 


Exercise 6. 2 . 2-8 proves that, when M — 1, the standard deviation of Cy is 
asymptotically ^ / (21 — 2 ir 2 )/3 N; this is reasonably small compared to (24). 


5.2.2 


SORTING BY EXCHANGING 121 


The other quantities can be found in a similar way (see exercise 23); when 
N > M we have 


A n =2(N+1)/{M + 2) - 1, 

B n = ±(N+ 1) (2 H n+1 - 2H m+2 + 1 - 6/(M + 2)) + |, 

Dn = (N + 1) (1 - 2H m+1 /(M + 2)) , 

E n = i (N + 1 )M(M - 1 )/(M + 2); 

S N = (N + 1)/(2M + 3) - 1, for TV > 2M+ 1. (25) 

The discussion above shows that it is possible to carry out an exact analysis 
of the average running time of a fairly complex program, by using techniques 
that we have previously applied only to simpler cases. 

Formulas (24) and (25) can be used to determine the best value of M on a 
particular computer. In Mix’s case, Program Q requires (35/3 )(N + l)H N+l + 
|(7V + l)f(M) - 34.5 units of time on the average, for N > 2 M + 1, where 


f{M) = 8 M 


70Hm+2 + 71-36 


Hm+i 
M + 2 


+ 


270 
M + 2 


54 

2M + 3 ' 


(26) 


We want to choose M so that f(M) is a minimum, and a simple computer 
calculation shows that M — 9 is best. The average running time of Program Q 
is approximately 11.667 (N + 1) In IV - 1.741V - 18.74 units when M — 9, for 
large N. 

So Program Q is quite fast, on the average, considering that it requires very 
little memory space. Its speed is primarily due to the fact that the inner loops, 
in steps Q3 and Q4, are extremely short — only three MIX instructions each (see 
lines 12-14 and 15-17). The number of exchanges, in step Q6, is only about 
1/6 of the number of comparisons in steps Q3 and Q4; hence we have saved a 
significant amount of time by not comparing i to j in the inner loops. 

But what is the worst case of Algorithm Q? Are there some inputs that it 
does not handle efficiently? The answer to this question is quite embarrassing: 
If the original file is already in order, with K± < K 2 < ■■ ■ < K N , each 
"partitioning” operation is almost useless, since it reduces the size of the subfile 
by only one element! So this situation (which ought to be easiest of all to sort) 
makes quicksort anything but quick; the sorting time becomes proportional to 
N 2 instead of IVlgJV. (See exercise 25.) Unlike the other sorting methods we 
have seen, Algorithm Q likes a disordered file. 

Hoare suggested two ways to remedy the situation, in his original paper, by 
choosing a better value of the test key K that governs the partitioning. One of 
his recommendations was to choose a random integer q between l and r in the 
last part of step Q2; we can change the instruction “A «— K” to 


K •<— Kq , R «— R q , Rq Ri , Ri <r- R (27) 

in that step. (The last assignment “A; <— R: is necessary; otherwise step Q4 
would stop with j — l — 1 when K is the smallest key of the subfile being 


122 SORTING 


5.2.2 


partitioned.) According to Eqs. ( 25 ), such random integers need to be calculated 
only 2 (TV + 1 )/ (M + 2) - 1 times on the average, so the additional running time 
is not substantial; and the random choice gives good protection against the 
occurrence of the worst case. Even a mildly random choice of q should be safe. 
Exercise 42 proves that, with truly random q , the probability of more than, say, 
20TVlnTV comparisons will surely be less than 10~ 8 . 

Hoare’s second suggestion was to look at a small sample of the file and to 
choose a median value of the sample. This approach was adopted by R. C. 
Singleton [CACM 12 (1969), 185-187], who suggested letting K q be the median 
of the three values 

K [(l+r)/2i , K r . ( 28 ) 

Singleton’s procedure cuts the number of comparisons down from 27V In TV to 
about -y TV In TV (see exercise 29). It can be shown that FJ\- is asymptotically 
Cn / 5 instead of Cjv / 6 in this case, so the median method slightly increases the 
amount of time spent in transferring the data; the total running time therefore 
decreases by roughly 8 percent. (See exercise 56 for a detailed analysis.) The 
worst case is still of order TV , but such slow behavior will hardly ever occur. 

W. D. Frazer and A. C. McKellar [JACM 17 (1970), 496-507] have suggested 
taking a much larger sample consisting of 2 k - 1 records, where k is chosen so 
that 2 k « TV/ In TV. The sample can be sorted by the usual quicksort method, 
then inserted among the remaining records by taking k passes over the hie 
(partitioning it into 2 k subfiles, bounded by the elements of the sample). Finally 
the subfiles are sorted. The average number of comparisons required by such 
a samplesort procedure is about the same as in Singleton’s median method, 
when TV is in a practical range, but it decreases to the asymptotic value TV lg TV 
as TV — > 00 . 

An absolute guarantee of 0(TV log TV) sorting time in the worst case, together 
with fast running time on the average, can be obtained by combining quicksort 
with other schemes. For example, D. R. Musser [Software Practice k Exper. 27 
(1997), 983-993] has suggested adding a “depth of partitioning” component to 
each entry on quicksort’s stack. If any subfile is found to have been subdivided 
more than, say, 2 lg TV times, we can abandon Algorithm Q and switch to Al- 
gorithm 5.2.3H. The inner loop time remains unchanged, so the average total 
running time remains almost the same as before. 

Robert Sedgewick has analyzed a number of optimized variants of quicksort 

in Acta Informatica 7 (1977), 327-356, and in CACM 21 (1978), 847 857. 

22 (1979), 368. See also J. L. Bentley and M. D. Mcllroy, Software Practice 
k Exper. 23 (1993), 1249-1265, for a version of quicksort that has been tuned 
up to fit the UNIX® software library, based on 15 further years of experience. 

Radix exchange. We come now to a method that is quite different from 
any of the sorting schemes we have seen before; it makes use of the binary 
representation of the keys, so it is intended only for binary computers. Instead 
of comparing two keys with each other, this method inspects individual bits of 


5.2.2 


SORTING BY EXCHANGING 123 


the keys, to see if they are 0 or 1. In other respects it has the characteristics of 
exchange sorting, and, in fact, it is rather similar to quicksort. Since it depends 
on radix 2 representations, we call it “radix exchange sorting.” The algorithm 
can be described roughly as follows: 

i) Sort the sequence on its most significant binary bit, so that all keys that 
have a leading 0 come before all keys that have a leading 1. This sorting is done 
by finding the leftmost key Ki that has a leading 1, and the rightmost key Kj 
with a leading 0. Then Ri and Rj are exchanged and the process is repeated 
until i > j. 

ii) Let F 0 be the elements with leading bit 0, and let F x be the others. Apply 
the radix exchange sorting method to F 0 (starting now at the second bit from 
the left instead of the most significant bit), until F 0 is completely sorted; then 
do the same for F\. 

For example, Table 3 shows how the radix exchange sort acts on our 16 
random numbers, which have been converted to octal notation. Stage 1 in the 
table shows the initial input, and after exchanging on the first, bit we get to 
stage 2. Stage 2 sorts the first group on bit 2, and stage 3 works on bit 3. (The 
reader should mentally convert the octal notation to 10-bit binary numbers. For 
example, 0232 stands for (0 010 011 010)2.) When we reach stage 5, after sorting 
on bit 4, we find that each group remaining has but a single element, so this part 
of the file need not be further examined. The notation “ 4 [0232 0252]” means 
that the subfile 0232 0252 is waiting to be sorted on bit 4 from the left. In this 
particular case, no progress occurs when sorting on bit 4; we need to go to bit 5 
before the items are separated. 

The complete sorting process shown in Table 3 takes 22 stages, somewhat 
more than the comparable number for quicksort (Table 2). Similarly, the number 
of bit inspections, 82, is rather high; but we shall see that the number of bit 
inspections for large N is actually less than the number of comparisons made 
by quicksort, assuming a uniform distribution of keys. The total number of 
exchanges in Table 3 is 17, which is quite reasonable. Note that bit inspections 
never have to go past bit 7 here, although 10-bit numbers are being sorted. 

As in quicksort, we can use a stack to keep track of the “boundary line 
information” for waiting subfiles. Instead of sorting the smallest subfile first, it 
is convenient simply to go from left to right, since the stack size in this case 
can never exceed the number of bits in the keys being sorted. In the following 
algorithm the stack entry (r, b) is used to indicate the right boundary r of a 
subfile waiting to be sorted on bit b; the left boundary need not actually be 
recorded in the stack — it is implicit because of the left-to-right nature of the 
procedure. 

Algorithm R ( Radix exchange sort). Records Ri,. ..,Rn are rearranged in 
place; after sorting is complete, their keys will be in order, K x < ■ ■ ■ < K N . Each 
key is assumed to be a nonnegative m-bit binary number, a 2 . . . a m ) 2 ; the ith 
most significant bit, oq, is called “bit i” of the key. An auxiliary stack with 
room for at most m — 1 entries is needed for temporary storage. This algorithm 


124 SORTING 


5.2.2 


cq cq cq cq cq cq <N cq 
co^ co~ co~ cd' co~ co~ <oT co~ <oT ccT 


CO fO M CO CO 
CO CO CD CO CD 


CO CO CO CO 

00 00 00 oo 


Hdcoi'^incoTfincoNcicO’i'Tfiococo^iocoh I 

COOO^fNTtiTfooOOOOOOOOCDTfOTfCOCOCOCOCOCOCO | 


COCOiOCONNNCDOl© 


Cl lO lO lO lO lO S 


ir — i: — t -T— ^ T— ( >—«, >-«< 1 — ^ •« — ( 4 -r— |T~H '■< 7 - 

pppppppppppt'*<0<0 < 0'0‘0C0<00>*0 < 0' , '~i 

Oj&JOl^SQOJOJ^^OjOjOjCoCo^CotoCoCoCo^toCo 
|C to to to IP to to to IP tp to IP 

PPPPPPPPPPPP»-('*-l>-l»'S>'H'*-(>-|'»-H>-H>-<<0 

GOCOGOCOCOGOCOCOGOGOCOGOCOCOCOCCICOCOCOCOCOCOCCI 

"» — I '•—H '*~~i '' — I — t ’’ — I ’’ — 1 ’’ — I ’’ — ) >-i _ >~l >~H >-< >— I >-H >-7 

CO CO CO CO CO “ ^ xT* ^ 

tototototototototototototototototototototototo 

P P P P P P P 

O}O}©QOJOjO}QQO)OjO}O}O}O}OiO)G0G0G0C0C0G0G0G0 

'*~-H ""““t ■*““( ’»~H — ( '' — < •» — 4 '< — I ■* — 4 ’^--I >-t >-H >"l »h •*-«< t-n T-s 

tO lO 

to to to P P P P P P 

»-i ^ ^P P P P P P 

tO<0 < OCO<OCOCO l OCOCO<OCOCOPPPPtototototo‘0 

pCOCO<OCO<OCOCOCO<OCOCOCOPPPp-^'v7.-v7-''7-'Nj»^- 

P00000000000001)0)^0)0)0)^0)CQ^ 

/*-( •» — ( ■» — 1 ■*--* -»~-S •« — ( ■< — ( 

CO 

<3Q to to to to to to to to to to 

0}COCOCOCCCOCOCOCOCOCOCOCOGOCO©QO}0}<3\)0}0)'^10} 

>-~| T-S T~H >~H »H >-| 

^ Tf lO 

-vj- -sf. ^ --^- -^7- 'sj- "••T- 

tlOOOOOC30QOOOPO'7NfNl'tNt''7'fN|-Nf' 
^!OtOlO'0C0!o(0C0'O!0^0)C'sN»i>i^N’>HN^ 

tOtotototOtototo'OlO^toiO'tOOOOOOOOO 

Cl0)0)0)0)©}a)5l0jCl^^^NQQ)OOC)OOQiO 

> ~l .’’"I . .’’-< . > ~H . .’’"S . ^ . *i >-< >-< >~i >-< ~*-t /’-H ^ 

CN (NCN(NCNCN(N(M(M(N 1 ^ ' 

Co Co CO^ Go 1 Go" Go" Cototototototototototototototototo 
OQOJOJOJOJOJOJPPPPPPPPPPPPPPPP 

'^■^‘'“^'^'^'^‘'^PPPPPPPPPPPPPPPP 


^•MOJOJOJOJOJOJPPPPPP 

Qjtototototototococococococo 

cocococococococopppppp 


ppppppppp 

cococococococococo 

ppppppppp 

CO<OCOCO<OCO<OCOCO 


P P Ol G\} Q\} 

CO Co to to to 

P P Co Co CO 

CO <0 CO <0 CO 


< 3 s) < 2 v) « 3 <» Q\J ©Q < 3 s} < 3 \) < 2 v) < 3 s} < 2 sJ 

totototototototototototo 

cocococococococococococo 

OOOOOOOQQJOOO 


to Co Go Co Co 

P O} O} CQ GQ 

p "■* '*■*■ '•* 

O C) Q o o 


cocococococococococococo 

CiCQCiOCjCiOQiCiQC) 


> 0 } < 0 J < 0 } O} < 0 } 

tO to to to to 

O} <3Q QQ <0} <3Q 

<O<O<O<O<O 


G}&)GjG)Gj 0 ) 0 )GlG} 0 )^^ 

totototototototototototo 

0 } 0 }< 3 \} 0 }< 2 v) 0 } 0 } 0 } 0 } 0 ) 0 }< 3 \} 

oooooooooooo 


Q>t 0 < 3 Q 0 )QQ 0 } 0 ) 0 } 0 j 

OPCoCoGOCoCoCoCo 

C 0 P 0 J 0 } 0 }QQ 0 } 0 ) 0 J 

NOOOOOOOO 


GQ CQ GQ O} O} C^ G>) 

Co CO CO Co CO Co Co 

G} GQ GQ G\} GQ GQ GQ 

o o o o o o o 


< 3 Q < 0 } OJ OJ < 0 } 

Co Co Co Co Co 

G) G) ^ O} G} 

C> CO <o CO CO 


ppppppppp 

^G)G)GJG)G)G)G)G) 


P P P P P P P 

GJ G) ^ G) Gj G) G) 


OOOCOOOOOOCOOOOOCOO 


p p p p p 

G} GQ G) G) 05 

> -'< >-l >~H >-1 >-H 

<0 CO CO Q) CO 


PP< 3 v}tototototolO 

cocotoPPPPPP 

PP<0J<O<O<O<O<O<O 

OO'OCOOO'OOO 


to to to to to to to 

p p p p p p p 

< 0 < 0 < 0 < 0 < 0 < 0<0 
CO CO CO CO CO CO CO 


to to to to to 

p p p p p 

CO CO CO CO co 

CO CO CO CO CO 


(MCOTjciOCOPOOOl 


(M CO Tf to CO 


The radix exchange method looks precisely once at every bit that is needed to determine the final order of the keys. 


5.2.2 


SORTING BY EXCHANGING 125 


essentially follows the radix exchange partitioning procedure described in the 
text above; certain improvements in its efficiency are possible, as described in 
the text and exercises below. 

Rl. [Initialize.] Set the stack empty, and set l «— 1, r «— IV, b 4— 1. 

R2. [Begin new stage.] (We now wish to sort the subfile Ri . . . R r on bit 6; 
from the nature of the algorithm, we have l < r.) If l — r, go to step RIO 
(since a one- word file is already sorted). Otherwise set i <— l, j r. 

R3. [Inspect Kj for 1.] Examine bit b of K t . If it is a 1, go to step R6. 

R4. [Increase *.] Increase i by 1. If i < j , return to step R3; otherwise go to 

step R.8. 

R5. [Inspect Kj + 1 for 0.] Examine bit b of Kj + If it is a 0, go to step R7. 

R6. [Decrease j.} Decrease j by 1. If i < j, go to step R5; otherwise go to 

step R8. 

R7. [Exchange Ri, Rj+i] Interchange records R, <-> Rj+i‘, then go to step R4. 

R8. [Test special cases.] (At this point a partitioning stage has been completed; 
i = j + 1, bit b of keys Ki, . . . ,Kj is 0, and bit b of keys Ki, ... , K r is 1.) 
Increase 6 by 1. If b > m, where m is the total number of bits in the keys, 
go to step RIO. (In such a case, the subfile R[ . . . R, has been sorted. This 
test need not be made if there is no chance of having equal keys present in 
the file.) Otherwise if j < l or j = r, go back to step R2 (all bits examined 
were 1 or 0, respectively). Otherwise if j — l, increase l by 1 and go to 
step R2 (there was only one 0 bit). 

R9. [Put on stack.] Insert the entry (r, b) on top of the stack; then set r j 
and go to step R2. 

RIO. [Take off stack.] If the stack is empty, we are done sorting; otherwise set 
l -f- r + 1, remove the top entry (r', b ') of the stack, set r <— r\ b b ' , and 
return to step R2. | 

Program R ( Radix exchange sort). The following MIX code uses essentially the 
same conventions as Program Q. We have rll = l — r, rI2 = r, rI3 = i , rI4 = j, 
rI5 = rn — b. rI6 = size of stack, except that it proves convenient for certain 
instructions (designated below) to leave rI3 = i — j or rI4 = j — i. Because of 
the binary nature of radix exchange, this program uses the operations SRB (shift 
right AX binary), JAE (jump A even), and JAO (jump A odd), defined in Section 
4.5.2. We assume that N > 2. 


01 

START ENT6 

0 

1 

Rl. Initialize. Set stack empty. 

02 

ENT1 

1-N 

1 

1 <r~ 1 . 

03 

ENT 2 

N 

1 

r <— N. 

04 

ENT5 

M-l 

1 

b<-l. 

05 

JMP 

IF 

1 

To R2 (omit testing l = r). 

06 

9H INC6 

1 

S 

R9. Put on stack. [rI4 

07 

ST2 

STACK, 6 (A) 

S 


08 

ST5 

STACK, 6(B) 

s 

(r, b) => stack. 


126 


SORTING 



5.2.2 

09 


ENN1 

0,4 

5 

rll «— l — j. 

10 


ENT2 

-1,3 

S 

r +- j. 

11 

1H 

ENT3 

0,1 

A 

R2. Begin new stage. Irl3 = i — 7 ] 

12 


ENT4 

0,2 

A 

i <— l, j <— r. [rI3 = i— j] 

13 

3H 

INC3 

0,4 

C' 

R3. Inspect Ki for 1. 

14 


LDA 

INPUT, 3 

C' 


15 


SEB 

0,5 

C' 

units bit of rA +- bit b of K t . 

16 


JAE 

4F 

C' 

To R4 if it is 0. 

17 

6H 

DEC4 

1,3 

C" + X 

R6. Decrease i. i <— 7 — 1. frI4= i — i] 

18 


J4N 

8F 

C" + X 

To R8 if j < i. [rI4 = j — i] 

19 

5H 

INC4 

0,3 

C" 

R5. Inspect Kj 4-1 for 0. 

20 


LDA 

INPUT+1,4 

C" 


21 


SRB 

0,5 

c" 

units bit of rA 4— bit b of Kj+ 1 . 

22 


JAO 

6B 

c" 

To R6 if it is 1. 

23 

7H 

LDA 

INPUT+1,4 

B 

R7. Exchange R , , 

24 


LDX 

INPUT, 3 

B 


25 


STX 

INPUT+1,4 

B 


26 


STA 

INPUT, 3 

B 


27 

4H 

DEC3 

-1,4 

C' -X 

R4. Increase i. i +- * + 1. [rI3 = i— ?| 

28 


J3NP 

3B 

C' -X 

To R3 if i < j. [rI3 = i—j] 

29 


INC3 

0,4 

A-X 

rI3 <— i. 

30 

8H 

J5Z 

OF 

A 

R8. Test special cases. [rI4 unknown] 

31 


DEC5 

1 

A-G 

To R10 if b = m, else £> + — 6+1. 

32 


ENT4 

-1,3 

A-G 

rI4 +- j. 

S3 


DEC4 

0,2 

A-G 

rI4 j — r. 

34 


J4Z 

IB 

A-G 

To R2 if j = r. 

35 


DEC4 

0,1 

A-G-R 

rI4 «— j — l. 

36 


J4N 

IB 

A-G- R 

To R2 if j < l. 

37 


J4NZ 

9B 

A-G-L-R 

To R9 if j / l. 

38 


INC1 

1 

K 

Z+-Z + 1. 

39 

2H 

J1NZ 

IB 

K + S 

Jump if Z / r. 

40 

OH 

ENT1 

1,2 

5 + 1 

R10. Take off stack. 

41 


LD2 

STACK, 6 (A) 

5+1 


42 


DEC1 

0,2 

5 + 1 


43 


LD5 

STACK, 6(B) 

5 + 1 

stack => ( r , b). 

44 


DEC6 

1 

5+1 


45 


J6NN 

2B 

5+1 

To R2 if stack was nonempty. | 


The running time of this radix exchange program depends on 


A = number of stages encountered with l < r; 

B = number of exchanges; 

C = C' + C" = number of bit inspections; 

G = number of times b > m in step R 8 ; 

K — number of times b < m, j — l in step R 8 ; ( 29 ) 

L = number of times b < m, j < l in step R 8 ; 

R — number of times b < m, j = r in step R 8 ; 

S = number of times things are entered onto the stack; 

X = number of times j < i in step R 6 . 


5.2.2 


SORTING BY EXCHANGING 127 


By Kirchhoff’s law, S — A — G — K — L — R , so the total running time comes to 
27 A + 8£? + 8 C — 22>G — 14 K — 17 L — 191? — X + 13 units. The bit-inspection 
loops can be made somewhat faster, as shown in exercise 34, at the expense of 
a more complicated program. It is also possible to increase the speed of radix 
exchange by using straight insertion whenever r — l is sufficiently small, as we 
did in Algorithm Q; but we shall not dwell on these refinements. 

In order to analyze the running time of radix exchange, two kinds of input 
data suggest themselves. We can 

i) assume that N = 2 m and that the keys to be sorted are simply the integers 
0, 1, 2, . . . , 2 m — 1 in random order; or 

ii) assume that m — oo (unlimited precision) and that the keys to be sorted 
are independent uniformly distributed real numbers in [0 . . 1). 

The analysis of case (i) is relatively easy, so it has been left as an exercise 
for the reader (see exercise 35). Case (ii) is comparatively difficult, so it has 
also been left as an exercise (see exercise 38). The following table shows crude 
approximations to the results of these analyses: 


Quantity 

Case (i) 

Case (ii) 

A 

N 

aN 

B 

\NlgN 

\NlgN 

C 

NlgN 

NlgN 

G 

-N 

2 1 

0 

K 

0 

I N 

2 1V 

L 

0 

±{a-l)N 

R 

0 

f(a- 1)N 

S 

-N 

2 iv 

-N 

X 

I N 

2 IV 

-N 

2 iV 


Here a = l/ln2 ss 1.4427. Notice that the average number of exchanges, bit 
inspections, and stack accesses is essentially the same for both kinds of data, 
even though case (ii) takes about 44 percent more stages. Our MIX program 
takes approximately 14.4 IV In N units of time, on the average, to sort N items 
in case (ii), and this could be cut to about 11.5 N In N using the suggestion of 
exercise 34; the corresponding figure for Program Q is 11.7 N In N, which can be 
decreased to about 10.6 N In iV using Singleton’s median-of-three suggestion. 

Thus radix exchange sorting takes about as long as quicksort, on the average, 
when sorting uniformly distributed data; on some machines it is actually a little 
quicker than quicksort. Exercise 53 indicates to what extent the process slows 
down for a nonuniform distribution. It is important to note that our entire 
analysis is predicated on the assumption that keys are distinct; radix exchange 
as defined above is not especially efficient when equal keys are present, since it 
goes through several time-consuming stages trying to separate sets of identical 


128 SORTING 


5.2.2 


keys before b becomes > m. One plausible way to remedy this defect is suggested 
in the answer to exercise 40. 

Both radix exchange and quicksort are essentially based on the idea of 
partitioning. Records are exchanged until the file is split into two parts: a left- 
hand subfile, in which all keys are < K. for some K , and a right-hand subfile 
in which all keys are > K . Quicksort chooses K to be an actual key in the 
file, while radix exchange essentially chooses an artificial key K based on binary 
representations. From a historical standpoint, radix exchange was discovered by 
P. Hildebrandt, H. Isbitz, H. Rising, and J. Schwartz [JACM 6 (1959), 156-163], 
about a year earlier than quicksort. Other partitioning schemes are also possible; 
for example, John McCarthy has suggested setting K « |(tt + v), if all keys are 
known to lie between u and v. Yihsiao Wang has suggested that the mean of 
three key values such as ( 28 ) be used as the threshold for partitioning; he has 
proved that the number of comparisons required to sort uniformly distributed 
random data will then be asymptotic to 1.082 NlgN. 

Still another partitioning strategy has been proposed by M. H. van Emden 
[CACM 13 (1970), 563-567]: Instead of choosing K in advance, we “learn” 
what a good K might be, by keeping track of K' = ma x(AT/, . . . , Ki) and K" — 
inin( Kj , . . . , K r ) as partitioning proceeds. We may increase 1 until encountering 
a key greater than K 1 , then decrease j until encountering a key less than K ", then 
exchange and/or adjust K' and K". Empirical tests on this “interval-exchange 
sort” method indicate that it is slightly slower than quicksort; its running time 
appears to be so difficult to analyze that an adequate theoretical explanation 
will never be found, especially since the subfiles after partitioning are no longer 
in random order. 

A generalization of radix exchange to radices higher than 2 is discussed in 
Section 5.2.5. 

* Asymptotic methods. The analysis of exchange sort ing algorithms leads to 
some particularly instructive mathematical problems that enable us to learn 
more about how to find the asymptotic behavior of functions. For example, we 
came across the function 

= ± £ «!r n - (31) 

0 <r<s<n 

in ( 9 ), during our analysis of the bubble sort; what is its asymptotic value? 

We can proceed as in our study of the number of involutions, Eq. 5 . 1 . 4 -( 4 i); 
the reader will find it helpful to review the discussion at the end of Section 5.1.4 
before reading further. 

Inspection of ( 31 ) shows that the contribution for s = n is larger than that 
for s = n — 1, etc.; this suggests replacing s by n - ,s. In fact, we soon discover 
that it is most convenient to use the substitutions t = n — s + l,m = n + l, so 
that ( 31 ) becomes 

— Wm-i = “ £ (m — 1 )\ £ r t_x . (32) 

m m\ ^ 

l<t<m 0 <r<m—t 


5.2.2 


SORTING BY EXCHANGING 129 


The inner sum has a well-known asymptotic series obtained from Euler’s sum- 
mation formula, namely 

E r<_1 = T ~\ {Nt _1 - 5n) + If (* - 1 )( iVt " 2 - s «) + ■■■ 

0<r<N 

1 k 

= ~ t E ( ? ) B ^ Nt ~ 3 ~ *ti) + 0{N t ~ k ) ( 33 ) 

3=0 J 

(see exercise 1.2.11.2-4); hence our problem reduces to studying sums of the 
form 

~j E - ty. (m - t)H k , k > — 1. (34) 

1 < £ < m 

As in Section 5.1.4 we can show that the value of this summand is negligi- 
ble, 0(exp(-n 5 )), whenever t is greater than m 1/2+e ; hence we may put t = 
0(to 1 / 2+c ) and replace the factorials by Stirling’s approximation: 


(m - t)! (m — ty 

ml 



We are therefore interested in the asymptotic value of 

n(m) = e~ t2/2m t k , k>- 1. (35) 

1 <t<m 

The sum could also be extended to the full range 1 < t < 00 without changing 
its asymptotic value, since the values for t > m 1/2+e are negligible. 

Let gk(x) = x k e~ x2 and f k (x) = gk(x/s/2m). When k > 0, Euler’s 
summation formula tells us that 

r m p R 

E /*«=/ AW^+E|(/r ) H-/l H 1 (o))+^ 

0 <t<m j = 1 

r B p ({x})fi p \x)dx 
P • Jo 

= (7b) 0 (/ (36) 

hence we can get an asymptotic series for r*,(m) whenever A: > 0 by using 
essentially the same ideas we have used before. But when k — -1 the method 
breaks down, since /-i(0) is undefined; we can’t merely sum from 1 to m either, 
because the remainders don’t give smaller and smaller powers of m when the 
lower limit is 1. (This is the crux of the matter, and the reader should pause to 
appreciate the problem before proceeding further.) 


130 SORTING 


5.2.2 


To resolve the dilemma we can define g_ t (x) = (e ~ x2 - l)/x and /_ 1 (x) = 
9 - 1 {x/%/ 2 m); then /_ i( 0 ) = 0 , and r_i(ra) can be obtained from 52 o<t<m /- i(t) 
in a simple way. Equation (36) is now valid for k — — 1, and the remaining 
integral is well known, 



= -7 - lnm + ln2 + 0 (e m / 2 ), 


by exercise 43 . 

Now we have enough facts and formulas to grind out the answer, 

W n = In to + i (7 + In 2)m — |\/ 27 rm+ §§ + 0 {nr 1/2 ), m = n + 1, (37) 

as shown in exercise 44 . This completes our analysis of the bubble sort. 

For the analysis of radix exchange sorting, we need to know the asymptotic 
value of the finite sum 

v - = E( n t )<-‘) t ¥ ^rj m 

as n — >- 00. This question turns out to be harder than any of the other asymptotic 
problems we have met so far; the elementary methods of power series expansions. 
Euler’s summation formula, etc., turn out to be inadequate. The following 
derivation has been suggested by N. G. de Bruijn. 

To get rid of the cancellation effects of the large factors (fc)(-l) fe in (38), 
we start by rewriting the sum as an infinite series 

U " = E(fc) (-D*£ (^n) J = E (2 J '(! - 2 - j r - 2 j +n). (39) 

k> 2 j> 1 j> 1 

If we set x = n/ 2 J , the summand is 

2*(l-2-'')"-2*+n = ^ + . 

When x < n e , we have 

( 1_ ^) = ex p( nln ( 1 “ ^)) =exp(-x + x 2 0 (n~ 1 )), (40) 

and this suggests approximating (39) by 

T n = J 2 ( 2 j e- n/2i ~ 2 j +n). (41) 

i> 1 


5.2.2 


SORTING BY EXCHANGING 131 


To justify this approximation, we have U n — T n = X n + Y n , where 


X n = ]T (2 j (l-2~ j ) n -Ve- n/2j ) 


J>i 

2 3 <n 1 


= E o( 


-n/2 3 


ne ) 


[the terms for x > n e ] 


[since 0 < 1 — 2 ? < e 


i> i 

2 J <n 1 ~ e 

0 (n log ne _n ) 


[since there are O(logn) terms]; 


and 


Y n = (2^(1 — 2 J ) n — 2 J e n / 2J [) [the terms for x < n e ] 


i>i 

2 3 >n 1 ~ 


= E ( e [by (40)]. 

2 3 >n 1 ~ e 


Our discussion below will demonstrate that the latter sum is 0(1); consequently 
U n — T n — 0(1). (See exercise 47.) 

So far we haven’t applied any techniques that are really different from those 
we have used before. But the study of T n requires a new idea, based on simple 
principles of complex variable theory: If x is any positive number, we have 

1 pl/ 2 -\-ioo 1 poo 

e~ x = — — : / Y(z)x~ z dz = — / T(i + dt. ( 42 ) 

Znl Jl/2-ioo J - 00 

To prove this identity, consider the path of integration shown in Fig. 20(a), where 
N, N\ and M are large. The value of the integral along this contour is the sum 
of the residues inside, namely 


E x ^ ^ bm (z + k)Y(z) 
(X k<M 


E xk 

0 <k<M 


k\ 


The integral on the top line is 0(f E |r(t + iN)\x 4 dt), and we have the well- 
known bound 


Y(t + iN) = 0(\t + tW| t-1/2 e _ * _,rAr/2 ) as N -> 00 . 

[For properties of the gamma function see, for example, Erdelyi, Magnus, Ober- 
hettinger, and Tricomi, Higher Transcendental Functions 1 (New York: McGraw- 
Hill, 1953), Chapter 1.] Therefore the top line integral is quite negligible, 
0(e -7rA / 2 f*^(N /xe ) 4 dt). The bottom line integral has a similar innocuous 
behavior. For the integral along the left line we use the fact that 

r(| -\-it- m) = r(| + + i + i*)...(-i + i + it) 

~ F(| + it)0(l/(M — 1)!); 


132 SORTING 


5.2.2 




(a) (b) 

Fig. 20. Contours of integration for gamma-function identities. 


hence the left-hand integral is 0(a; M - 1 / 2 /(M-l)!) |r(| + it) | dt. Therefore 

as M, N, N' — t oo, only the right-hand integral survives, and this proves ( 42 ). 
In fact, ( 42 ) remains valid if we replace i by any positive number. 

The same argument can be used to derive many other useful relations 
involving the gamma function. We can replace x~ z by other functions of 2 ; 
or we can replace the constant | by other quantities. For example, 


1 

27T i 


L 


-3/2+zo 


3/2 — zoo 


r (z)x z dz = e x - 1 + X, 


and this is the critical quantity in our formula ( 41 ) for T„ : 


Tr, = 


i E — f 

h 270 J 


— 3/2+ioo 


3/2— zoo 


T{z)(n/V)- 1 ~ z dz. 


(43) 


(44) 


The sum may be placed inside the integrals, since its convergence is absolutely 
well-behaved; we have 


E ( n / ^) W n w ' s ^2 l (l/2 w y — n w /(2 w — 1), when 5ft (w) > 0, 

j>i j > 1 

because | 2 "’| = 2 ^) > 1 . Therefore 


T — 

- L n — 


— / 

2m J_ 


— 3/2+zoo 


FWi 


3/2— zoo 


-1-z 


- 1 


dz, 


(45) 


and it remains to evaluate the latter integral. 

This time we integrate along a path that extends far to the right, as in 
Fig. 20(b). The top line integral is 0(n 1 / 2 e _7rJV / 2 f™ /2 \M + iN\ l dt), if 2 iN ^ 1. 
and the bottom line integral is equally negligible, when N and N' are much 
larger than M. The right-hand line integral is \T(M + it)\ dt). 

Fixing M and letting N, N' 00 shows that -T n /n is 0(n“ 1 - M ) plus the sum 
of the residues in the region —3/2 < 5ft(z) < M. The factor I' (c ) has simple poles 
at 2 = — 1 and z = 0, while n~ 1 ~ z has no poles, and l/(2~ 1 ~ z - 1) has simple 
poles when z = -1 + 27rifc/ln2. 




5.2.2 


SORTING BY EXCHANGING 133 


The double pole at z = —1 is the hardest to handle. We can use the well- 
known relation 

T (z + 1) = exp(— 7 Z + C(2)z 3 /2 - C(3)* 3 /3 + C(4)* 4 /4 -**•)> 

where £(s) = l _s + 2~ s + 3“ s + • • • = to deduce the following expansions 
when w = z + 1 is small: 

= T ( w + ^ = -w~ l + (7 - 1) + 0{w), 
w(w — 1 ) 

n~ l ~ z = 1 — w Inn + 0(w 2 ), 


1/(2 1 2 - 1) = —w x /ln2 - \ + O(w). 


The residue at 2 = -1 is the coefficient of in -1 in the product of these three 
formulas, namely | - (Inn + 7 - l)/ln2. Adding the other residues gives the 
formula 


T, 


In n + 7 


„± +Hn) + l + 0 (n- M ), 

n In 2 2 n 

for arbitrarily large M, where 5(n) is a rather strange function, 


(46) 


j (») = 575 E 5R(T( — 1 — 2nik/\n2) exp(2niklgn)) . (47) 

n/ fc> 1 


Notice that S(n) = S(2n). The average value of S(n) is zero, since the average 
value of each term is zero. (We may assume that (lgn) modi is uniformly 
distributed, in view of the results about floating point numbers in Section 4.2.4.) 
Furthermore, since |T(-1 + it ) | = |7r/(t(l + t 2 ) sinh7rt)| 1/2 , it is not difficult to 
show that 

1 8{n) | < 0.000000173; (48) 

thus we may safely ignore S(n) for practical purposes. For theoretical purposes, 
however, we can’t obtain a valid asymptotic expansion of U n without it; that is 
why U n is a comparatively difficult function to analyze. 

From the definition of T n in (41) we can see immediately that 


Tgn _ Tn 
2 n n 


1 e~ n 

1 • 

n n 


(49) 


Therefore the error term 0(n~ M ) in (46) is essential; it cannot be replaced by 
zero. However, exercise 54 presents another approach to the analysis, which 
avoids such error terms by deriving a rather peculiar convergent series. 

In summary, we have deduced the behavior of the difficult sum (38): 

I/ n =nlgn + n ^ +d(n)^ +0(1). (50) 


The gamma-function method we have used to obtain this result is a special case 
of the general technique of Mellin transforms , which are extremely useful in the 
study of radix-oriented recurrence relations. Other examples of this approach 


134 SORTING 


5.2.2 


can be found in exercises 51-53 and in Section 6.3. An excellent introduction 
to Mellin transforms and their applications to algorithmic analysis has been 
presented by P. Flajolet, X. Gourdon, and P. Dumas in Theoretical Computer 
Science 144 (1995), 3-58. 

EXERCISES 

1. [M‘20] Let a \ ... a n be a permutation of {1, . . . , n), and let i and j be indices such 
that i < j and ax > aj . Let a[ . . . a' n be the permutation obtained from ax . . . a n by 
interchanging a, and aj. Can a ] . . . a' n have more inversions than a\. . . a n ? 

► 2. [M25] (a) What is the minimum number of exchanges that will sort the permuta- 
tion 37698145 2? (b) In general, given any permutation n = ax ... a n of {1 , . . . , n}, 
let xch(7r) be the minimum number of exchanges that will sort 7r into increasing order. 
Express xch(7r) in terms of “simpler” characteristics of 7r. (See exercise 5.1.4-41 for 
another way to measure the disorder of a permutation.) 

3. [10] Is the bubble sort Algorithm B a stable sorting algorithm? 

4. [M23] If t = 1 in step B4, we could actually terminate Algorithm B immediately, 
because the subsequent step B2 will do nothing useful. What is the probability that 
t = 1 will occur in step B4 when sorting a random permutation? 

5. [M25] Let bi 62 • • • b„ be the inversion table for the permutation ax 02 ■ . . a n . Show 
that the value of BOUND after r passes of the bubble sort is max {bi + i | b t > r} — r, for 
0 < r < max (61 , ... , b n ). 

6. [M22] Let ax...a n be a permutation of {l,...,n} and let a[ . . . a' n be its in- 
verse. Show that the number of passes to bubble-sort ax-..a n is l + max(o' 1 — 1, 
a 2 - 2 ,...,a' n - n). 

7. [ M28 ] Calculate the standard deviation of the number of passes for the bubble 
sort, and express it in terms of n and the function P(n). [See Eqs. (6) and (7).] 

8. [M24] Derive Eq. (8). 

9. [M43] Analyze the number of passes and the number of comparisons in the cock- 
tail-shaker sorting algorithm. Note: See exercise 5. 4. 8-9 for partial information. 

10. [M26] Let ax a 2 ■ ■ ■ a n be a 2-ordered permutation of {1,2,..., 11 } . 

a) What are the coordinates of the endpoints of the dith step of the corresponding 
lattice path? (See Fig. 11 on page 87.) 

b) Prove that the comparison/exchange of a 1 : 0,2 , <13 : <7,4 , ... corresponds to folding 
the path about the diagonal, as in Fig. 18(b). 

c) Prove that the comparison/exchange of a 2 : a 2 +d, 0.4 : ax+d- ... corresponds to 
folding the path about a line m units below the diagonal, as in Figs. 18(c), (d), 
and (e), when d = 2 m — 1. 

► 11. [M25] What permutation of {1,2, ...,16} maximizes the number of exchanges 
done by Batcher’s algorithm? 

12. [24] Write a MIX program for Algorithm M, assuming that MIX is a binary com- 
puter with the operations AND, SRB. How much time does your program take to sort 
the sixteen records in Table 1? 

13. [10] Is Batcher’s method a stable sorting algorithm? 


5.2.2 


SORTING BY EXCHANGING 135 


14. [M21] Let c(N) be the number of key comparisons used to sort N elements by 
Batcher’s method; this is the number of times step M4 is performed. 

a) Show that c(2*) = 2c(2 f_1 ) + (t - l)2 f_1 + 1, for t > 1. 

b) Find a simple expression for c(2*) as a function of t. Hint: Consider the sequence 
x t =c( 2 t )/2 t . 

15. [ M38 ] The object of this exercise is to analyze the function c(N) of exercise 14, 

and to find a formula for c(N) when N = 2 ei + 2 62 H f 2 Et ', ei > e 2 > • • • > e r > 0. 

a) Let a(N) = c(iV+l) -c(N). Prove that a(2n) = a(n)+ [lg(2n)J, and a(2n + l) = 
a(n) + 1; hence 

a (N) = ^ j j - r(ei — 1) + (ei + e-i + • • • + e r ). 

b) Let x (n) = a(n) - a( \n/2\ ), so that a(n) = x(n) + x( [n/2j ) + x( |n/4J ) + • • • . Let 

y( n ) = ®(l) + *(2)d fa :(n); and let z(2n) = y(2n)-a(n), z(2n + l) = y(2n + l). 

Prove that c(N + 1) = z(N) + 2z( |_1V/2J ) + 4 z( [2V/4J ) H . 

c) Prove that y(N) = N + (\_N/2\ + l)(cj - 1) - 2 ei + 2. 

d) Now put everything together and find a formula for c(N) in terms of the exponents 
ej, holding r fixed. 

16. Find the asymptotic value of the average number of exchanges occurring 
when Batcher’s method is applied to a random permutation of N distinct elements, 
assuming that N is a power of two. 

► 17. [20] Where in Algorithm Q do we use the fact that K 0 and K N+1 have the values 
postulated in (13)? 

► 18. [20] Explain how the computation proceeds in Algorithm Q when all of the input 
keys are equal. What would happen if the u <” signs in steps Q3 and Q4 were changed 
to “<” instead? 

19. [15] Would Algorithm Q still work properly if a queue (first-in-first-out) were 
used instead of a stack (last-in-first-out)? 

20. [M20] What is the largest possible number of elements that will ever be on the 
stack at once in Algorithm Q, as a function of M and N? 

21 . [20] Explain why the first partitioning phase of Algorithm Q takes the number of 
comparisons and exchanges specified in (17), when the keys are distinct. 

22. [M25] Let PkN be the probability that the quantity A in (16) will equal k. when 
Algorithm Q is applied to a random permutation of {1,2,..., A?}, and let Ajv(z) = 
J2k PkNZ k be the corresponding generating function. Prove that A N (z) = 1 for N < M, 
and An(z) = z(5Zi< s< jv A s ~i(z)An-s(z))/N for N > M. Find similar recurrence 
relations defining the other probability distributions B N (z), C N (z ), D N (z), E N (z), 
Sn(z). 

23. [MSS] Let Ajv, Bn, Dn, En, Sn be the average values of the corresponding 
quantities in (16), when sorting a random permutation of (1,2,..., TV } . Find recur- 
rence relations for these quantities, analogous to (18); and solve these recurrences to 
obtain (25). 

24. [ M21 ] Algorithm Q obviously does a few more comparisons than it needs to, since 
we can have i = j in step Q3 and even i > j in step Q4. How many comparisons Cn 
would be done on the average if we avoided all comparisons when i > j? 


136 SORTING 


5.2.2 


25. [M20] When the input keys are the numbers 1 2 ... IV in order, what are the exact 
values of the quantities A, B, C, D, E, and S in the timing of Program Q? (Assume 
that N > M.) 

► 26. [M24] Construct an input file that makes Program Q go even more slowly than 
it does in exercise 25. (Try to find a really bad case.) 

27. [M28] (R. Sedgewick.) Consider the best case of Algorithm Q : Find a permutation 
of {1,2,..., 23} that takes the least time to be sorted when N = 23 and M = 3. 

28. [ M26 ] Find the recurrence relation analogous to ( 20 ) that is satisfied by the 
average number of comparisons in Singleton’s modification of Algorithm Q (choosing 
s as the median of {A'i , Aq(jv+i)/ 2 j > A;v} instead of s — K\). Ignore the comparisons 
made when computing the median value s. 

29. [HM40] Continuing exercise 28, find the asymptotic value of the number of com- 
parisons in Singleton’s “median of three” method. 

► 30. [ 25] (P. Shackleton.) When multiword keys are being sorted, many sorting meth- 
ods become progressively slower as the file gets closer to its final order, since equal 
and nearly-equal keys require an inspection of several words to determine the proper 
lexicographic order. (See exercise 5-5.) Files that arise in practice often involve such 
keys, so this phenomenon can have a significant impact on the sorting time. 

Explain how Algorithm Q can be extended to avoid this difficulty; within a subfile 
in which the leading k words are known to have constant values for all keys, only the 
( k + l)st words of the keys should be inspected. 

► 31. [20] (C. A. R. Hoare.) Suppose that, instead of sorting an entire file, we only 
want to determine the mt.h smallest of a given set of n elements. Show that quicksort 
can be adapted to this purpose, avoiding many of the computations required to do a 
complete sort. 

32. [M 40 ] Find a simple closed form expression for C„ m , the average number of key 
comparisons required to select the mth smallest of n elements by the “quickfind” 
method of exercise 31. (For simplicity, let M = 1; that is, don’t assume the use of 
a special technique for short subfiles.) What is the asymptotic behavior of C( 2m -i)m, 
the average number of comparisons needed to find the median of 2m - 1 elements by 
Hoare’s method? 

► 33. [15] Design an algorithm that rearranges all the numbers in a given table so 
that all negative values precede all nonnegative ones. (The items need not be sorted 
completely, just separated between negative and nonnegative.) Your algorithm should 
use the minimum possible number of exchanges. 

34. [20] How can the bit-inspection loops of radix exchange (in steps R3 through R6) 
be speeded up? 

35. [M23] Analyze the values of the frequencies A, B. C, G, A, L, R. S, and X that 
arise in radix exchange sorting using “case (i) input.” 

36. [M2 7] Given a sequence of numbers (a n ) = a 0 , a lt o 2 , . . . , define its binomial 
transform {a„) = ao,di,a 2 , ... by the rule 

a) Prove that ( a„ ) = (a„). 

b) Find the binomial transforms of the sequences (1); (n); {(")), for fixed m; ( a n ), 
for fixed a; ((^)a n ), for fixed a and m. 


5.2.2 


SORTING BY EXCHANGING 137 


c) Suppose that a sequence ( x n ) satisfies the relation 


Xn — Ojn ~t“ 2 


'■•EG) 


Xk , 


for n > 2; 


xo = = no = ai 


0. 


Prove that the solution to this recurrence is 

— £ (I) — +£ (;)(-.)* 

k>2 k > 2 

37 . [AF28] Determine all sequences (a n ) such that (a„) = (a n ), in the sense of exer- 
cise 36. 

► 38 . [M30] Find An, Bn, Cn, Gn, Kn, Ln, Rn, and Xn, the average values of the 
quantities in (29), when radix exchange is applied to “case (ii) input.” Express your 
answers in terms of N and the quantities 


U n 


£(;) 


(-ii* 
2 k ~ 1 - 1 


V n 


£(I) 


(-p*fc 

2 k ~ 1 - 1 


— Tl(U n Un-!). 


[Hint: See exercise 36.] 

39. [20] The results shown in (30) indicate that radix exchange sorting involves about 
1.441V partitioning stages when it is applied to random input. Prove that quicksort 
will never require more than N stages; and explain why radix exchange often does. 

40. [21] Explain how to modify Algorithm R so that it works with reasonable effi- 
ciency when sorting files containing numerous equal keys. 

► 41. [30] Devise a good way to exchange records Ri ... R r so that they are partitioned 
into three blocks, with (i) Kk < K for l < k < i; (ii) Kk = K for i < k < j; (iii) 
Kk > K for j < k < r. Schematically, the final arrangement should be 


< K 


= K 


> K 


42. [HM32] For any real number c > 0, prove that the probability is less than e~ c 
that Algorithm Q will make more than (c + 1)(1V + 1 )Hn comparisons when sorting 
random data. (This upper bound is especially interesting when c is, say, N e .) 

43. [HM21] Prove that f* y~ 1 (e~ y — 1 ) dy + y~ 1 e~ y dy = —7. [Hint: Consider 
lim a _ > o+ y a_1 -] 

44. [HM24] Derive (37) as suggested in the text. 

45. [HM20] Explain why (43) is true, when x > 0. 

46. [HM20] What is the value of (l/27ri) T(z)n s ~ z dz/(2 3 ~ z — 1), given that 

s is a positive integer and 0 < a < s? 

47. [HM21] Prove that '}Z J>1 (ji/2 i ) e~ n ^ 2J is a bounded function of n. 

48. [HM24] Find the asymptotic value of the quantity V n defined in exercise 38, using 
a method analogous to the text’s study of U n , obtaining terms up to 0(1). 

49. [HM24 ] Extend the asymptotic formula (47) for U n to 0(n -1 ). 

50. [HM24] Find the asymptotic value of the function 


138 SORTING 


5.2.2 


when m. is any fixed number greater than 1 . (When m is an integer greater than 2 , 
this quantity arises in the study of generalizations of radix exchange, as well as the 
trie memory search algorithms of Section 6.3.) 

► 51. [HM28] Show that the gamma-function approach to asymptotic problems can be 
used instead of Euler’s summation formula to derive the asymptotic expansion of the 
quantity r k (m) in ( 35 ). (This gives us a uniform method for studying r k (m ) for all k. 
without relying on tricks such as the text’s introduction of g-i(x) = ( e -* 2 - \)/x.) 

52. [HM35] (N. G. de Bruijn.) What is the asymptotic behavior of the sum 

where d.(t.) is the number of divisors of £? (Thus, d(l) = 1 , d{ 2) = d( 3 ) = 2 , d( 4 ) = 3 , 
d(5) = 2, etc. This question arises in connection with the analysis of a tree traversal 
algorithm, exercise 2.3.1-11.) Find the value of S„/( 2 n n ) to terms of 0(n“ 1 ). 

53. [HM42] Analyze the average number of bit inspections and exchanges done by 
radix exchange when the input data consists of infinite-precision binary numbers in 
[O' • X )’ ea c h of whose bits is independently equal to 1 with probability p. (Only the 
case p = - is discussed in the text; the methods we have used can be generalized to 
arbitrary p.) Consider in particular the case p = 1 /</> = .61803 .... 

54. [HM24] (S. O. Rice.) Show that U n can be written 


U n = (-1)’ 


n\ 

2tti 


dz 


1 


)c z (z - 1 ) ... (z - n ) 2 2_1 - 1 ’ 

where C is a skinny closed curve encircling the points 2, 3 , . . . , n. Changing C to an 
arbitrarily large circle centered at the origin, derive the convergent series 

2 


Un = ^ n ~ 1 1 )" _ 2 + 2 + 

In 2 2 In 2 


X(B(n + 1 , -1 + ibm)), 


m > 1 


where b - 2tt/ I n 2 , and B(n+ 1, -1+ibm) = T(n + 1)T(-1 + ibm)/T(n + ibm) = 
«>/n*=o(*- 1 + ibm). 

► 55. [22] Show how to modify Program Q so that the partitioning element is the 
median of the three keys ( 28 ), assuming that M > 1 . 

56. [M43] Analyze the average behavior of the quantities that occur in the running 
time of Algorithm Q when the program has been modified to take the median of three 
elements as in exercise 55. (See exercise 29.) 


5.2.3. Sorting by Selection 

Another important family of sorting techniques is based on the idea of repeated 
selection. The simplest selection method is perhaps the following: 

i) Find the smallest key; transfer the corresponding record to the output area; 
then replace the key by the value 00 (which is assumed to be higher than 
any actual key). 

ii) Repeat step (i). This time the second smallest key will be selected, since 
the smallest key has been replaced by 00 . 

iii) Continue repeating step (i) until N records have been selected. 


5.2.3 


SORTING BY SELECTION 139 


A selection method requires all of the input items to be present before sorting 
may proceed, and it generates the final outputs one by one in sequence. This is 
essentially the opposite of insertion, where the inputs are received sequentially 
but we do not know any of the final outputs until sorting is completed. 

Step (i) involves N—l comparisons each time a new record is selected, and it 
also requires a separate output area in memory. But we can obviously do better: 
We can move the selected record into its proper final position, by exchanging it 
with the record currently occupying that position. Then we need not consider 
that position again in future selections, and we need not deal with infinite keys. 
This idea yields our first selection sorting algorithm. 

Algorithm S ( Straight selection sort). Records i?i, . . . ,f?jv are rearranged in 
place; after sorting is complete, their keys will be in order, K x < ■ ■ ■ < K N . 
Sorting is based on the method indicated above, except that it proves to be 
more convenient to select the largest element first, then the second largest, etc. 

51. [Loop on j.\ Perform steps S2 and S3 for j = N, N - 1, ... , 2. 

52. [Find max(A'i, . . . , Kj).] Search through keys Kj,Kj-i, ... ,K X to find a 
maximal one; let it be A"), where i is as large as possible. 

53. [Exchange with Rj.} Interchange records R t o R } . (Now records Rj,..., R N 
are in their final position.) | 



Fig. 21. Straight selection sorting. 


Table 1 shows this algorithm in action on our sixteen example keys. Elements 
that are candidates for the maximum during the right-to-left search in step S2 
are shown in boldface type. 

Table 1 

STRAIGHT SELECTION SORTING 


503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 1 
503 087 512 061 703 170 897 275 653 426 154 509 612 677 765 | 908 
503 087 512 061 703 170 765 275 653 426 154 509 612 677 | 897 908 
503 087 512 061 703 170 677 275 653 426 154 509 612 | 765 897 908 
503 087 512 061 612 170 677 275 653 426 154 509 | 703 765 897 908 
503 087 512 061 612 170 509 275 653 426 154 | 677 703 765 897 908 

061 | 087 154 170 275 426 503 509 512 612 653 677 703 765 897 908 




140 SORTING 


5.2.3 


The corresponding MIX program is quite simple: 

Program S ( Straight selection sort). As in previous programs of this chapter, 
the records in locations INPUT+1 through INPUT+N are sorted in place, on a full- 
word key. rA = current maximum, rll = j - 1, rI2 = k (the current search 


position), rI3 = i 

. Assume that N >2. 

01 

START 

ENT1 

N-l 

1 

SI. Lood on i. 7 «— N. 

02 

2H 

ENT2 

0,1 

N — 1 

S2. Find max( K\ K , ) . 

03 


ENT3 

1,1 

N-l 

i <- j. 

04 


LDA 

INPUT , 3 

N - 1 

rA <- Ki. 

05 

8H 

CMPA 

INPUT, 2 

A 


06 


JGE 

*+3 

A 

Jump if Ki > Kk- 

07 


ENT3 

0,2 

B 

Otherwise set i <— k, 

08 


LDA 

INPUT, 3 

B 

rA +- Ki. 

09 


DEC2 

1 

A 

k 4 — k — 1. 

10 


J2P 

8B 

A 

Repeat if k > 0. 

11 


LDX 

INPUT+1 , 1 

N-l 

S3. Exchanse with R , . 

12 


STX 

INPUT, 3 

N-l 

R : i — Rj . 

13 


STA 

INPUT+1 , 1 

N- 1 

Rj +— rA. 

14 


DEC1 

1 

N - 1 


15 


J1P 

2B 

N - 1 

N>j> 2. | 


The running time of this program depends on the number of items, N; the 
number of comparisons, A; and the number of changes to right-to-left maxima, B. 
It is easy to see that 

-MaH"'"- 1 )' W 

regardless of the values of the input keys; hence only B is variable. In spite of the 
simplicity of straight selection, this quantity B is not easy to analyze precisely. 
Exercises 3 through 6 show that 

B = (min 0, ave (N + 1)H N - 2N , max [N 2 /A\)\ ( 2 ) 

in this case the maximum value turns out to be particularly interesting. The 
standard deviation of B is of order TV 3 / 4 ; see exercise 7. 

Thus the average running time of Program S is 2.5N 2 + 3 (N + 1 )H n + 
3-5 N - 11 units, just slightly slower than straight insertion (Program 5. 2. IS). 
It is interesting to compare Algorithm S to the bubble sort (Algorithm 5.2.2B), 
since bubble sorting may be regarded as a selection algorithm that sometimes 
selects more than one element at a time. For this reason bubble sorting usually 
does fewer comparisons than straight selection and it may seem to be preferable; 
but in fact Program 5.2.2B is more than twice as slow as Program S! Bubble 
sorting is handicapped by the fact that it does so many exchanges, while straight 
selection involves very little data movement. 

Refinements of straight selection. Is there any way to improve on the 
selection method used in Algorithm S? For example, take the search for a 
maximum in step S2; is there a substantially faster way to find a maximum? 
The answer to the latter question is no! 


5.2.3 


SORTING BY SELECTION 141 


Lemma M. Every algorithm for finding the maximum of n elements, based on 
comparing pairs of elements, must make at least n — 1 comparisons. 

Proof. If we have made fewer than n — 1 comparisons, there will be at least two 
elements that have never been found to be less than any others. Therefore we do 
not know which of these two elements is larger, and we cannot have determined 
the maximum. | 

Thus, any selection process that finds the largest element must perform at 
least n — 1 comparisons; and we might suspect that all sorting methods based on 
n repeated selections are doomed to require fl(n 2 ) operations. But fortunately 
Lemma M applies only to the first selection step; subsequent selections can make 
use of previously gained information. For example, exercises 8 and 9 show that 
a comparatively simple change to Algorithm S will cut the average number of 
comparisons in half. 

Consider the 16 numbers in Table 1; one way to save time on repeated 
selections is to regard them as four groups of four. We can start by determining 
the largest in each group, namely the respective keys 

512, 908, 653, 765; 

the largest of these four elements, 908, is then the largest of the entire file. To 
get the second largest we need only look at 512, 653, 765, and the other three 
elements of the group containing 908; the largest of {170, 897, 275} is 897, and 
the largest of 

512, 897, 653, 765 

is 897. Similarly, to get the third largest element we determine the largest of 
{170, 275} and then the largest of 

ji 

512, 275, 653, 765. 

Each selection after the first takes at most 5 additional comparisons. In general, 
if N is a perfect square, we can divide the file into \/N groups of \ /N elements 
each; each selection after the first takes at most y/N — 2 comparisons within 
the group of the previously selected item, plus \/N — 1 comparisons among the 
“group leaders.” This idea is called quadratic selection; its total execution time 
is 0(Ny r N), which is substantially better than order N 2 . 

Quadratic selection was first published by E. H. Friend [JACM 3 (1956), 

152 154], who pointed out that the same idea can be generalized to cubic, 
quartic, and higher degrees of selection. For example, cubic selection divides the 
file into i/N large groups, each containing x^N small groups, each containing '/N 
records; the execution time is proportional to N \fN . If we carry this idea to its 
ultimate conclusion we arrive at what Friend called “nth degree selecting,” based 
on a binary tree structure. This method has an execution time proportional to 
N log N ; we shall call it tree selection. 

Tree selection. The principles of tree selection sorting are easy to understand 
in terms of matches in a typical “knockout tournament.” Consider, for example, 


142 SORTING 


5.2.3 


the results of the ping-pong contest shown in Fig. 22; at the bottom level, Kim 
beats Sandy and Chris beats Lou, then in the next round Chris beats Kim, etc. 


Chris 



Chris 



Pat 


Kim 

Chris 

Pat 

Robin 

Kim 

Sandy 

Chris 

Lou 

Pat 

Ray 

Dale 

Robin 


Fig. 22. A ping-pong tournament. 


Figure 22 shows that Chris is the champion of the eight players, and 8-1 = 7 
matches/comparisons were required to determine this fact. Pat is not necessarily 
the second-best player; any of the people defeated by Chris, including the first- 
round loser Lou, might possibly be second best. We can determine the second- 
best player by having Lou play Kim, and the winner of that match plays Pat: 
only two additional matches are required to find the second-best player, because 
of the structure we have remembered from the earlier games. 

In general, we can “output” the player at the root of the tree, and replay 
the tournament as if that player had been sick and unable to play a good game. 
Then the original second-best player will rise to the root; and to recalculate the 
winners in the upper levels of the tree, only one path must be changed. It follows 
that fewer than [lglV] further comparisons are needed to select the second-best 
player. The same procedure will find the third-best, etc.; hence the total time for 
such a selection sort will be roughly proportional to N log N, as claimed above. 

Figure 23 shows tree selection sorting in action, on our 16 example numbers. 
Notice that we need to know where the key at the root came from, in order to 
know where to insert the next “-oo”. Therefore each branch node of the tree 
should actually contain a pointer or index specifying the position of the relevant 
key, instead of the key itself. It follows that we need memory space for N input 
records, N — 1 pointers, and N output records or pointers to those records. 
(If the output goes to tape or disk, of course, we don’t need to retain the output 
records in high-speed memory.) 

The reader should pause at this point and work exercise 10, because a good 
understanding of the basic principles of tree selection will make it easier to 
appreciate the remarkable improvements we are about to discuss. 

One way to modify tree selection, essentially introduced by K. E. Iverson 
[A Programming Language (Wiley, 1962), 223-227], does away with the need for 
pointers by “looking ahead” in the following way: When the winner of a match 
in the bottom level of the tree is moved up, the winning value can be replaced 
immediately by — oo at the bottom level; and whenever a winner moves up from 
one branch to another, we can replace the corresponding value by the one that 
should eventually move up into the vacated place (namely the larger of the two 
keys below). Repeating this operation as often as possible converts Fig. 23(a) 
into Fig. 24. 


5-2.3 SORTING BY SELECTION 143 



512 908 653 765 



503 512 908 897 653 509 677 765 


/\ /\ /\ /\ /\ /\ /\ /\ 

503 087 512 061 908 170 897 275 653 426 154 509 612 677 765 703 

(a) Initial configuration. 



512 897 653 765 



503 512 170 897 653 509 677 765 


/\ /\ /\ /\ /\ /\ /\ /\ 

503 087 512 061 -oo 170 897 275 653 426 154 509 612 677 765 703 

(b) Key 908 is replaced by — oo, and the second highest element moves up to the root. 



503 512 170 275 426 509 -oo -oo 

/\ /\ /\ /\ /\ /\ /\ /\ 


503 087 512 061 -oo 170 -oo 275 -oo 426 154 509 -oo -oo -oo -oo 

(c) Configuration after 908, 897, 765, 703, 677, 653, and 612 have been output. 

Fig. 23. An example of tree selection sorting. 



512 275 653 703 



503 061 170 -oo 426 509 677 -oo 


/\ /\ /\ /\ /\ /\ /\ /\ 

— oo 087 — oo -oo — oo -oo — oo — oo — oo — oo 154 — oo 612 — oo — oo -oo 


Fig. 24. The Peter Principle applied to sorting. Everyone rises to their level of 
incompetence in the hierarchy. 


144 SORTING 


5.2.3 


Once the tree has been set up in this way we can proceed to sort by a “top- 
down” method, instead of the “bottom up” method of Fig. 23: We output the 
root, then move up its largest descendant, then move up the latter’s largest 
descendant, and so forth. The process begins to look less like a ping-pong 
tournament and more like a corporate system of promotions. 

The reader should be able to see that this top-down method has the ad- 
vantage that redundant comparisons of -oo with — oo can be avoided. (The 
bottom-up approach finds — oo omnipresent in the latter stages of sorting, but 
the top-down approach can stop modifying the tree during each stage as soon 
as a — oo has been stored.) 

Figures 23 and 24 are complete binary trees with 16 terminal nodes (see 
Section 2. 3. 4. 5), and it is convenient to represent such trees in consecutive 
locations as shown in Fig. 25. Note that the parent of node number k is node 
\k/2 \ , and its children are nodes 2k and 2k + 1. This leads to another advantage 
of the top-down approach, since it is often considerably simpler to go top-down 
from node k to nodes 2k and 2k + 1 than bottom-up from node k to nodes k ® ] 
and [k/2\. (Here k ® 1 stands for k + 1 or k - 1, according as k is even or odd.) 



Fig. 25. Sequential storage allocation for a complete binary tree. 


Our examples of tree selection so far have more or less assumed that N is 
a power of 2; but actually we can work with arbitrary N, since the complete 
binary tree with N terminal nodes is readily constructed for any N. 

Now we come to the crucial question: Can’t we do the top-down method 
without using — oo at all? Wouldn’t it be nice if the important information of 
Fig. 24 were all in locations 1 through 16 of the complete binary tree, without the 
useless “holes” containing — oo? Some reflection shows that it is indeed possible 
to achieve this goal, not only eliminating — oo but also avoiding the need for an 
auxiliary output area. This line of thinking leads us to an important sorting 
algorithm that was christened “heapsort” by its discoverer J. W. J. Williams 
[CACM 7 (1964), 347-348], 

Heapsort. Let us say that a file of keys Ki, K 2 , . . . , K N is a heap if 
K U/ 2J > K j for 1 < LJ/2J <j<N. 


( 3 ) 




5.2.3 


SORTING BY SELECTION 145 


Thus, K\ > K' 2 , Ki > K :i . K 2 > K 4 , etc.; this is exactly the condition that 
holds in Fig. 24, and it implies in particular that the largest key appears “on top 
of the heap,” 

K 1 =max(K 1 ,K 2 ,...,K N ). ( 4 ) 

If we can somehow transform an arbitrary input file into a heap, we can sort the 
elements by using a top-down selection procedure as described above. 

An efficient approach to heap creation has been suggested by R. W. Floyd 
[CACM 7 (1964), 701]. Let us assume that we have been able to arrange the file 
so that 

K um ^ K i for 1 < Li/ 2 J < j < N, ( 5 ) 

where l is some number > 1. (In the original file this condition holds vacuously for 
l = |A72j, since no subscript j satisfies the condition [A7/2J < [_j/2j < j < N.) 
It is not difficult to see how to transform the file so that the inequalities in ( 5 ) 
are extended to the case l ~ |_ j / 2 j , working entirely in the subtree whose root 
is node l. Then we can decrease l by 1, until condition ( 3 ) is finally achieved. 
These ideas of Williams and Floyd lead to the following elegant algorithm, which 
merits careful study: 

Algorithm H ( Heapsort ). Records Ai,...,Ajv are rearranged in place; after 
sorting is complete, their keys will be in order, K L < ■ ■ ■ < I( N . First we 
rearrange the file so that it forms a heap, then we repeatedly remove the top of 
the heap and transfer it to its proper final position. Assume that N > 2. 

HI. [Initialize.] Set l «— \_N/2\ + 1, r 4— N. 

H2. [Decrease l or r.] If l > 1, set / 4 - l - 1, R i?/, K <- K t . (If l > 1 , we are 
in the process of transforming the input file into a heap; on the other hand 
if l = 1, the keys Ki K- 2 . . . K r presently constitute a heap.) Otherwise set 
R <— R r , K <— K r , R r 4 — Ri, and r <— r — 1; if this makes r = 1, set 
Ri 4— R and terminate the algorithm. 

H3. [Prepare for siftup.] Set j f- l. (At this point we have 

Ky k/2 j > K k for l < [k/2\ < k < r; ( 6 ) 

and record R k is in its final position for r < k < N. Steps H3-H8 are called 
the siftup algorithm ; their effect is equivalent to setting Ri 4 — R and then 
rearranging A;, . . . , R r so that condition ( 6 ) holds also for l — [k/2\.) 

H4. [Advance downward.] Set i 4 — j and j 4- 2 j. (In the following steps we 
have i ~ Li/2J.) If j < r, go right on to step H5; if j = r, go to step H 6 ; 
and if j > r , go to H 8 . 

H5. [Find larger child.] If Kj < A J+ i, then set j i — j T 1. 

H6. [Larger than AT?] If K > Kj , then go to step H 8 . 

H7. [Move it up.] Set Ri 4 — Rj, and go back to step H4. 

H8. [Store A.] Set Ri 4— A. (This terminates the siftup algorithm initiated in 
step H3.) Return to step H2. | 


146 SORTING 


5.2.3 



Fig. 26. Heapsort; dotted lines enclose the siftup algorithm. 

Heapsort has sometimes been described as the @ algorithm, because of the 
motion of l and r. The upper triangle represents the heap-creation phase, when 
r — N and l decreases to 1; and the lower triangle represents the selection phase, 
when l = 1 and r decreases to 1. Table 2 shows the process of heapsorting our 
sixteen example numbers. (Each line in that table shows the state of affairs at 
the beginning of step H2, and brackets indicate the position of l and r.) 

Program H (Heapsort). The records in locations INPUT+1 through INPUT+N 
are sorted by Algorithm H, with the following register assignments: rll = l — 1, 
rI2 = r — 1, rI3 = i, rI4 = j, rI5 = r - j, rA = K = R,rX = Rj. 


01 

START 

ENT1 

N/2 

1 

HI. Initialize. 1 +- | JV/2 ( + 1. 

02 


ENT2 

N-l 

1 

r 4— N. 

03 

1H 

DEC1 

1 

L7V/2J 

l<-l- 1. 

04 


LDA 

INPUT+1 , 1 

L7V/2J 

R+- Ri,K <- K,. 

05 

3H 

ENT4 

1,1 

P 

H3. Prepare for siftup. i «— l. 

06 


ENT5 

0,2 

P 


07 


DEC5 

0,1 

P 

rI5 r — j . 

08 


JMP 

4F 

P 

To H4. 

09 

5H 

LDX 

INPUT, 4 

B + A-D 

H5. Find larger child. 

10 


CMPX 

INPUT+1, 4 

B + A- D 


11 


JGE 

6F 

B + A-D 

Jump if Kj > Kj + 1 . 

12 


INC4 

1 

C 

Otherwise set j <— j + 1. 

13 


DEC5 

1 

C 


U 

9H 

LDX 

INPUT, 4 

C + D 

rX +- Rj. 

15 

6H 

CMPA 

INPUT, 4 

B + A 

H6. Larger than K? 

16 


JGE 

8F 

B + A 

To H8 if K > Kj. 

17 

7H 

STX 

INPUT, 3 

B 

H7. Move it up. R, <— Rt. 

18 

4H 

ENT3 

0,4 

B + P 

H4. Advance downward, i +- i 

19 


DEC5 

0,4 

B + P 

rI5 <— rI5 — j. 

20 


INC4 

0,4 

B + P 

j j + j. 








5.2.3 


SORTING BY SELECTION 147 


Table 2 

EXAMPLE OF HEAPSORT 


Ah 

K2 

Ah 

Ah 

Ah 

Ah 

Ah 

Ah 

Ah 

A 10 

Ahi 

Ai 2 

a 13 

a 14 

Ais 

Ahe 

l 

r 

503 

087 

512 

061 

908 

170 

897 

275 

[653 

426 

154 

509 

612 

677 

765 

703] 

9 

16 

503 

087 

512 

061 

908 

170 

897 

[703 

653 

426 

154 

509 

612 

677 

765 

275] 

8 

16 

503 

087 

512 

061 

908 

170 

[897 

703 

653 

426 

154 

509 

612 

677 

765 

275] 

7 

16 

503 

087 

512 

061 

908 

[612 

897 

703 

653 

426 

154 

509 

170 

677 

765 

275] 

6 

16 

503 

087 

512 

061 

[908 

612 

897 

703 

653 

426 

154 

509 

170 

677 

765 

275] 

5 

16 

503 

087 

512 

[703 

908 

612 

897 

275 

653 

426 

154 

509 

170 

677 

765 

061] 

4 

16 

503 

087 

[897 

703 

908 

612 

765 

275 

653 

426 

154 

509 

170 

677 

512 

061] 

3 

16 

503 

[908 

897 

703 

426 

612 

765 

275 

653 

087 

154 

509 

170 

677 

512 

061] 

2 

16 

[908 

703 

897 

653 

426 

612 

765 

275 

503 

087 

154 

509 

170 

677 

512 

061] 

1 

16 

[897 

703 

765 

653 

426 

612 

677 

275 

503 

087 

154 

509 

170 

061 

512] 

908 

1 

15 

[765 

703 

677 

653 

426 

612 

512 

275 

503 

087 

154 

509 

170 

061] 

897 

908 

1 

14 

[703 

653 

677 

503 

426 

612 

512 

275 

061 

087 

154 

509 

170] 

765 

897 

908 

1 

13 

[677 

653 

612 

503 

426 

509 

512 

275 

061 

087 

154 

170] 

703 

765 

897 

908 

1 

12 

[653 

503 

612 

275 

426 

509 

512 

170 

061 

087 

154] 

677 

703 

765 

897 

908 

1 

11 

[612 

503 

512 

275 

426 

509 

154 

170 

061 

087] 

653 

677 

703 

765 

897 

908 

1 

10 

[512 

503 

509 

275 

426 

087 

154 

170 

061] 

612 

653 

677 

703 

765 

897 

908 

1 

9 

[509 

503 

154 

275 

426 

087 

061 

170] 

512 

612 

653 

677 

703 

765 

897 

908 

1 

8 

[503 

426 

154 

275 

170 

087 

061] 

509 

512 

612 

653 

677 

703 

765 

897 

908 

1 

7 

[426 

275 

154 

061 

170 

087] 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

1 

6 

[275 

170 

154 

061 

087] 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

1 

5 

[170 

087 

154 

061] 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

1 

4 

[154 

087 

061] 

170 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

1 

3 

[087 

061] 

154 

170 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

1 

2 


21 


J5P 

5B 

B + P 

To H5 if j < r. 

22 


J5Z 

9B 

P - A + D 

To H6 if j = r. 

23 

8H 

STA 

INPUT ,3 

P 

H8. Store R. Rj s— R. 

24 

2H 

J1P 

IB 

P 

H2. Decrease l or r. 

25 


LDA 

INPUT+ 1,2 

N - 1 

If l = 1, set R s- R r , K s— Ah 

26 


LDX 

INPUT+ 1 

N — 1 


27 


STX 

INPUTS- 1,2 

N — 1 

R r <- An 

28 


DEC2 

1 

N - 1 

r S— r — 1. 

29 


J2P 

3B 

N — 1 

To H3 if r > 1 . 

30 


STA 

INPUTs-1 

1 

Ah S— R. | 


Although this program is only about twice as long as Program S, it is much 
more efficient when N is large. Its running time depends on 

P = N + [N/2\ — 2, the number of siftup passes; 

A, the number of siftup passes in which the key K finally lands 
in an interior node of the heap; 

B, the total number of keys promoted during siftups; 

C, the number of times j t— j + 1 in step H5; and 

D, the number of times j = r in step H4. 


148 SORTING 


5.2.3 


These quantities are analyzed below; in practice they show comparatively little 
fluctuation about their average values, 

A ss 0.3491V, B « NlgN - 1.871V, 

C « ±AlgA - 0.941V, D&\gN. (7 ' 

For example, when A’ = 1000, four experiments on random input gave, respec- 
tively, A = 371, 351, 341, 340; B ■= 8055, 8072, 8094, 8108; C = 4056, 4087, 
4017, 4083; and D = 12, 14, 8, 13. The total running time, 

7 A + UB + 4 C + 20A -2 D + 15[iV/2j - 28, 

is therefore approximately 16 A” lg A + 0.01 A units on the average. 

A glance at Table 2 makes it hard to believe that heapsort is very efficient: 
large keys migrate to the left before we stash them at the right! It is indeed a 
strange way to sort, when N is small; the sorting time for the 16 keys in Table 2 
is 1068u, while the simple method of straight insertion (Program 5. 2. IS) takes 
only 514 m. Straight selection (Program S) takes 853w. 

For larger A, Program H is more efficient. It invites comparison with 
shellsort (Program 5.2. ID) and quicksort (Program 5.2.2Q), since all three pro- 
grams sort by comparisons of keys and use little or no auxiliary storage. When 
N = 1000, the approximate average running times on MIX are 

160000 m for heapsort, 

130000u for shellsort, 

80000 m for quicksort. 

(MIX is a typical computer, but particular machines will of course yield somewhat 
different relative values.) As A gets larger, heapsort will be superior to shell- 
sort, but its asymptotic running time 16AlgA ss 23.08ATnA will never beat 
quicksort’s 11.67 N In N. A modification of heapsort discussed in exercise 18 will 
speed up the process by substantially reducing the number of comparisons, but 
even this improvement falls short of quicksort. 

On the other hand, quicksort is efficient only on the average, and its worst 
case is of order N 2 . Heapsort has the interesting property that its worst case 
isn’t much worse than the average: We always have 

A < 1.5 AT, B < A|_lgA_|, C<N[\gN\, (8) 

so Program H will take no more than 18A|_lg N\ + 38N units of time, regardless 
of the distribution of the input data. Heapsort is the first sorting method we 
have seen that is guaranteed to be of order A log A. Merge sorting, discussed in 
Section 5.2.4 below, also has this property, but it requires more memory space. 

Largest in, first out. We have seen in Chapter 2 that linear lists can often be 
classified in a meaningful way by the nature of the insertion and deletion oper- 
ations that make them grow and shrink. A stack has last-in-first-out behavior, 
in the sense that every deletion removes the youngest item in the list — the item 
that was inserted most recently of all items currently present. A simple queue 


5.2.3 


SORTING BY SELECTION 149 


has first-in-first-out behavior, in the sense that every deletion removes the oldest 
remaining item. In more complex situations, such as the elevator simulation of 
Section 2.2.5, we want a smallest-in-first-out list, where every deletion removes 
the item having the smallest key. Such a list may be called a priority queue, 
since the key of each item reflects its relative ability to get out of the list quickly. 
Selection sorting is a special case of a priority queue in which we do N insertions 
followed by N deletions. 

Priority queues arise in a wide variety of applications. For example, some 
numerical iterative schemes are based on repeated selection of an item having 
the largest (or smallest) value of some test criterion; parameters of the selected 
item are changed, and it is reinserted into the list with a new test value, based on 
the new values of its parameters. Operating systems often make use of priority 
queues for the scheduling of jobs. Exercises 15, 29, and 36 mention other typical 
applications of priority queues, and many other examples will appear in later 
chapters. 

How shall we implement priority queues? One of the obvious methods is 
to maintain a sorted list, containing the items in order of their keys. Inserting 
a new item is then essentially the same problem we have treated in our study 
of insertion sorting, Section 5.2.1. Another even more obvious way to deal with 
priority queues is to keep the list of elements in arbitrary order, selecting the 
appropriate element each time a deletion is required by finding the largest (or 
smallest) key. The trouble with both of these obvious approaches is that they 
require f l(N) steps either for insertion or deletion, when there are N entries in 
the list, so they are very time-consuming when N is large. 

In his original paper on heapsorting, Williams pointed out that heaps are 
ideally suited to large priority queue applications, since we can insert or delete 
elements from a heap in 0(\ogN) steps; furthermore, all elements of the heap 
are compactly located in consecutive memory locations. The selection phase of 
Algorithm H is a sequence of deletion steps of a largest-in-first-out process: To 
delete the largest element Ki we remove it and sift Kn up into a new heap of 
N — 1 elements. (If we want a smallest-in-first-out algorithm, as in the elevator 
simulation, we can obviously change the definition of heap so that “>” becomes 
u <” in ( 3 ); for convenience, we shall consider only the largest-in-first-out case 
here.) In general, if we want to delete the largest item and then insert a new 
element x, we can do the sift up procedure with 

/ = 1, r = N, and K = x. 

If we wish to insert an element x without a prior deletion, we can use the bottom- 
up procedure of exercise 16. 

A linked representation for priority queues. An efficient way to represent 
priority queues as linked binary trees was discovered in 1971 by Clark A. Crane 
[Technical Report STAN-CS-72-259 (Computer Science Department, Stanford 
University, 1972)]. His method requires two link fields and a small count in 
every record, but it has the following advantages over a heap: 


150 SORTING 


5.2.3 


i) When the priority queue is being treated as a stack, the insertion and 
deletion operations take a fixed time independent of the queue size. 

ii) The records never move, only the pointers change. 

iii) Two disjoint priority queues, having a total of TV elements, can easily be 
merged into a single priority queue, in only O(logTV) steps. 

Crane’s original method, slightly modified, is illustrated in Fig. 27, which 
shows a special kind of binary tree structure. Each node contains a KEY field, a 
DIST field, and two link fields LEFT and RIGHT. The DIST field is always set to 
the length of a shortest path from that node to the null link A; in other words, 
it is the distance from that node to the nearest empty subtree. If we define 
DIST (A) = 0 and KEY (A) = — oo, the KEY and DIST fields in the tree satisfy the 
following properties: 

KEY(P) > KEY (LEFT (P) ) , KEY (P) > KEY (RIGHT (P) ) ; ( 9 ) 

DIST(P) = 1 + min(DIST(LEFT(P)),DIST(RIGHT(P))); (10) 

DIST (LEFT (P) ) > DIST(RIGHTCP)) . (11) 

Relation ( 9 ) is analogous to the heap condition ( 3 ); it guarantees that the root 
of the tree has the largest key. Relation (10) is just the definition of the DIST 
fields as stated above. Relation ( 11 ) is the interesting innovation: It implies that 
a shortest path to A may always be obtained by moving to the right. We shall 
say that a binary tree with this property is a leftist tree, because it tends to lean 
so heavily to the left. 

It is clear from these definitions that DIST(P) — n implies the existence of 
at least 2” empty subtrees below P; otherwise there would be a shorter path 
from P to A. Thus, if there are TV nodes in a leftist tree, the path leading 
downward from the root towards the right contains at most [lg(IV + 1 ) J nodes. 
It is possible to insert a new node into the priority queue by traversing this path 
(see exercise 33); hence only O(logTV) steps are needed in the worst case. The 
best case occurs when the tree is linear (all RIGHT links are A), and the worst 
case occurs when the tree is perfectly balanced. 

To remove the node at the root, we simply need to merge its two subtrees. 
The operation of merging two disjoint leftist trees, pointed to respectively by 
P and Q, is conceptually simple: If KEY (P) > KEY (Q) we take P as the root 
and merge Q with P’s right subtree; then DIST(P) is updated, and LEFT(P) is 
interchanged with RIGHT (P) if necessary. A detailed description of this process 
is not difficult to devise (see exercise 33). 

Comparison of priority queue techniques. When the number of nodes, 
TV, is small, it is best to use one of the straightforward linear list methods to 
maintain a priority queue; but when TV is large, a log TV method using heaps 
or leftist trees is obviously much faster. In Section 6.2.3 we shall discuss the 
representation of linear lists as balanced trees, and this leads to a third log TV 
method suitable for priority queue implementation. It is therefore appropriate 
to compare these three techniques. 


5.2.3 


SORTING BY SELECTION 151 



We have seen that leftist tree operations tend to be slightly faster than heap 
operations, although heaps consume less memory space because they have no 
link fields. Balanced trees take about the same space as leftist trees, perhaps 
slightly less; the operations are slower than heaps, and the programming is more 
complicated, but the balanced tree structure is considerably more flexible in 
several ways. When using a heap or a leftist tree we cannot predict very easily 
what will happen to two items with equal keys; it is impossible to guarantee 
that items with equal keys will be treated in a last-in-first-out or first-in-first- 
out manner, unless the key is extended to include an additional “serial number of 
insertion” field so that no equal keys are really present. With balanced trees, on 
the other hand, we can easily stipulate consistent conventions about equal keys, 
and we can also do things such as “insert x immediately before (or after) y." 
Balanced trees are symmetrical, so that we can delete either the largest or the 
smallest element at any time, while heaps and leftist trees must be oriented 
one way or the other. (See exercise 31, however, which shows how to construct 
symmetrical heaps.) Balanced trees can be used for searching as well as for 
sorting; and we can rather quickly remove consecutive blocks of elements from 
a balanced tree. But tt(N) steps are needed in general to merge two balanced 
trees, while leftist trees can be merged in only O(loglV) steps. 

In summary, heaps use minimum memory; leftist trees are great for merging 
disjoint priority queues; and the flexibility of balanced trees is available, if 
necessary, at reasonable cost. 






152 SORTING 


5.2.3 


/^\ Many new ways to represent priority queues have been discovered since the 
jl pioneering work of Williams and Crane discussed above. Programmers now 
have a large menu of options to ponder, besides simple lists, heaps, leftist or 
balanced trees: 

• stratified trees, which provide symmetrical priority queue operations in only 
O (log log M) steps when all keys lie in a given range 0 < K < M [P. van 
Emde Boas, R. Kaas, and E. Zijlstra, Math. Systems Theory 10 (1977), 
99-127]; 

• binomial queues [J. Vuillemin, CACM 21 (1978), 309-315; M. R. Brown, 
SICOMP 7 (1978), 298-319]; 

• pagodas [J. Frangon, G. Viennot, and J. Vuillemin, FOCS 19 (1978), 1-7]; 

• pairing heaps [M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. 
Algorithmica 1 (1986), 111 129; J. T. Stasko and J. S. Vitter, CACM 30 
(1987), 234-249; M. L. Fredman, JACM 46 (1999), 473-501]; 

• skew heaps [D. D. Sleator and R. E. Tarjan, SICOMP 15 (1986), 52-59]; 

• Fibonacci heaps [M. L. Fredman and R. E. Tarjan, JACM 34 (1987), 596- 
615] and the more general AF-heaps [M. L. Fredman and D. E. Willard, 
J. Computer and System Sci. 48 (1994), 533- 551]; 

• calendar queues [R. Brown, CACM 31 (1988), 1220-1227; G. A. Davison, 
CACM 32 (1989), 1241-1243]; 

• relaxed heaps [J. R. Driscoll, H. N. Gabow, R. Shrairman, and R. E. Tarjan, 
CACM 31 (1988), 1343-1354]; 

• fishspear [M. J. Fischer and M. S. Paterson, JACM 41 (1994), 3-30]; 

• hot queues [B. V. Cherkassky, A. V. Goldberg, and C. Silverstein, SICOMP 
28 (1999), 1326-1346]; 

etc. Not all of these methods will survive the test of time; leftist trees are in fact 
already obsolete, except for applications with a strong tendency towards last-in- 
first-out behavior. Detailed implementations and expositions of binomial queues 
and Fibonacci heaps can be found in D. E. Knuth, The Stanford GraphBase 
(New York: ACM Press, 1994), 475-489. 


* Analysis of heapsort. Algorithm H is rather complicated, so it probably will 
never submit to a complete mathematical analysis; but several of its properties 
can be deduced without great difficulty. Therefore we shall conclude this section 
by studying the anatomy of a heap in some detail. 

Figure 28 shows the shape of a heap with 26 elements; each node has been 
labeled in binary notation corresponding to its subscript in the heap. Asterisks 
in this diagram denote the special nodes, those that lie on the path from 1 to N. 

One of the most important attributes of a heap is the collection of its subtree 
sizes. For example, in Fig. 28 the sizes of the subtrees rooted at 1, 2, . . . , 26 are, 
respectively, 


26*, 15, 10*, 7, 7, 6*, 3, 3, 3, 3, 3, 3, 2*, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1*. ( 12 ) 


Asterisks denote special subtrees, rooted at the special nodes; exercise 20 shows 
that if the binary representation of N is 


N — (b n b n -i ■ ■ ■ bibo)?, 


(13) 


n= LlgiVJ, 


5.2.3 


SORTING BY SELECTION 153 



Fig. 28. A heap of 26 = (11010)2 elements looks like this. 


then the special subtree sizes are always 


(Ifrn— 1 ■ • • ^1^0)2 1 (lfrn-2 • ■ • bibo)2, ■ ■ ■ , (lf>lf>o)2; (l^o)2) (1)2- (14) 


Nonspecial subtrees are always perfectly balanced, so their size is always of the 
form 2 k — 1. Exercise 21 shows that the nonspecial sizes consist of exactly 


N-l 

2 


Is, 


N — 2 
4 


3s, 


N - 4 
„ 8 . 


7 s, 


N- 2 n ~ 1 
2 n 


(2 n 


l)s. (15) 


For example, Fig. 28 contains twelve nonspecial subtrees of size 1, six of size 3, 
two of size 7, and one of size 15. 

Let si be the size of the subtree whose root is (, and let M N be the multiset 
{si, S2 , ■ ■ ■ , sjv} of all these sizes. We can calculate M n easily for any given N 
by using (14) and (15). Exercise 5.1.4-20 tells us that the total number of ways 
to arrange the integers {1,2,..., N} into a heap is 


N\/siS 2 ■ ■ ■ sjv = N\ /IK* | s G M n }. (16) 

For example, the number of ways to place the 26 letters { A , B, C , . . . , Z} into 
Fig. 28 so that vertical lines preserve alphabetic order is 


26! / (26 • 10 ■ 6 • 2 • 1 • l 12 • 3 6 • 7 2 • 15 1 ). 


We are now in a position to analyze the heap-creation phase of Algorithm H, 
namely the computations that take place before the condition l — 1 occurs for 
the first time in step H2. Fortunately we can reduce the study of heap creation 
to the study of independent siftup operations, because of the following theorem. 

Theorem H. If Algorithm H is applied to a random permutation of {1,2, ... ,N}, 
each of the N\ j {{ {.s j s € Af ,v } possible heaps is an equally likely outcome of the 
heap-creation phase. Moreover, each of the |_iV/ 2J siftup operations performed 
during this phase is uniform, in the sense that each of the si possible values of i 
is equally likely when step H8 is reached. 


154 SORTING 


5.2.3 


Proof. We can apply what numerical analysts might call a “backwards analysis” ; 
given a possible result K\ . . . Kn of the siftup operation rooted at l, we see that 
there are exactly s ; prior configurations K[. . . K' N of the file that will sift up 
to that result. Each of these prior configurations has a different value of K[ ; 
hence, working backwards, there are exactly s; s ;+1 . . . s N input permutations of 
{1,2,..., N } that yield the configuration K x . . . K N after the siftup at position l 
has been completed. 

The case l — 1 is typical: Let K x . . . K N be a heap, and let K[. . . K' N be 
a file that is transformed by siftup into K\...Kn when l = 1 , K = K[. If 
K = K t , we must have K[ - -Kp / 2 j, K [ l/2 \ = K \i/i\, etc., while K' } = K J 
for all j not on the path from 1 to i. Conversely, for each i this construction 
yields a file K[ . . . K' N such that (a) siftup transforms K[. . . K' N into Ky . . . Kn, 
and (b) Kyj/ 2 \ > Kj for 2 < [j/ 2 j < j < N. Therefore exactly N such files 
K[. . . K' n are possible, and the siftup operation is uniform. (An example of the 
proof of this theorem appears in exercise 22 .) | 

Referring to the quantities A, B, C, D in the analysis of Program H, we can 
see that a uniform siftup operation on a subtree of size s contributes [s / 2 J /s to 
the average value of A; it contributes 

7(0 + 1 + 1 + 2 + • • ■ + LlgsJ) - - VLlgfcJ = - ((a + l)LlgaJ - 2 L' 8 < J +1 + 2 ) 

S k= i s 

to the average value of B (see exercise 1.2.4-42); and it contributes either 2/s or 
0 to the average value of D, according as s is even or odd. The corresponding 
contribution to C is somewhat more difficult to determine, so it has been left to 
the reader (see exercise 26). Summing over all siftups, we find that the average 
value of A during heap creation is 

A' n = ^2{[s/2\/s \ s e Mn}, ( 17 ) 

and similar formulas hold for B, C, and D. It is therefore possible to compute 
these average values exactly without great difficulty, and the following table 
shows typical results: 


N 

A' 

a n 

B' n 

C N 

D' n 

99 

19.18 

68.35 

42.95 

0.00 

100 

19.93 

69.39 

42.71 

1.84 

999 

196.16 

734.66 

464.53 

0.00 

1000 

196.94 

735.80 

464.16 

1.92 

9999 

1966.02 

7428.18 

4695.54 

0.00 

10000 

1966.82 

7429.39 

4695.06 

1.97 

10001 

1966.45 

7430.07 

4695.84 

0.00 

10002 

1967.15 

7430.97 

4695.95 

1.73 


Asymptotically speaking, we may ignore the special subtree sizes in M N , and we 
find for example that 

NO N 1 N 3 

^ = Y'l + T '3 + ¥‘ 7 + '" + = i 1 - h<*) N + 0(log AO, ( 18 ) 


5.2.3 


SORTING BY SELECTION 155 


where 

a = = 1-60669 51524 15291 76378 33015 23190 92458 04806-. ( 19 ) 

fc>i z 

(This value was first computed to high precision by J. W. Wrench, Jr., using the 
series transformation of exercise 27. Paul Erdos has proved that a is irrational 
[J. Indian Math. Soc. 12 (1948), 63-66], and Peter Borwein has demonstrated 
the irrationality of many similar constants [Proc. Camb. Phil. Soc. 112 (1992), 
141-146].) For large N, we may use the approximate formulas 

A' n k 0.1967./V + ( — 1) JV 0.3; 

B' n ps 0.744031V - 1.3 In IV; 

C' N « 0.470341V - 0.8 In IV; 

D' n k, (1.8 ± 0.2) [N even] . 

The minimum and maximum values are also readily determined. Only O(N) 
steps are needed to create the heap (see exercise 23). 

This theory nicely explains the heap-creation phase of Algorithm H. But 
the selection phase is another story, which remains to be written! Let A" N , B'^,, 
Cm, and D'n denote the average values of A, B, C, and D during the selection 
phase when N elements are being heapsorted. The behavior of Algorithm H on 
random input is subject to comparatively little fluctuation about the empirically 
determined average values 

A'jf ps 0.1521V; 

B'^ «JVlgJV-2.61JV; 

C'n ps |lVlglV — 1.411V; ^ 21 

D'n lglV ± 2; 

but no adequate theoretical explanation for the behavior of D ^ or for the 
conjectured constants 0.152, 2.61, or 1.41 has yet been found. The leading 
terms of B'^ and C ^ have, however, been established in an elegant manner by 
R. Schaffer and R. Sedgewick; see exercise 30. Schaffer has also proved that the 
minimum and maximum possible values of C are respectively asymptotic to 
jlVlglV and |lVlglV. 

EXERCISES 

1. [10] Is straight selection (Algorithm S) a stable sorting method? 

2. [15] Why does it prove to be more convenient to select the largest key, then 
the second-largest, etc., in Algorithm S, instead of first finding the smallest, then the 
second-smallest, etc.? 

3. [M21] (a) Prove that if the input to Algorithm S is a random permutation of 
{1,2,..., IV}, then the first iteration of steps S2 and S3 yields a random permutation 
of {1,2, . . . ,N— 1} followed by N. (In other words, the presence of each permutation 
of {1,2,..., TV — 1 } in K\ . . . Kn-i is equally likely.) (b) Therefore if fljv denotes the 


156 SORTING 


5.2.3 


average value of the quantity B in Program S, given randomly ordered input, we have 
Bn = Hn — 1 + Bn-i ■ [Hint: See Eq. 1.2.10-(i6).] 

► 4. [M25] Step S3 of Algorithm S accomplishes nothing when i = j; is it a good idea 
to test whether or not i = j before doing step S3? What is the average number of 
times the condition i = j will occur in step S3 for random input? 

5. [20] What is the value of the quantity B in the analysis of Program S, when the 
input is IV. . . 3 2 1? 

6. [M29] (a) Let a\a 2 ...aN be a permutation of {1,2,..., IV} having C cycles, 
I inversions, and B changes to the right-to-left maxima when sorted by Program S. 
Prove that 2 B < I + N — C. [Hint: See exercise 5.2.2— 1.] (b) Show that I + N — C < 
[N 2 /2\; hence B can never exceed |_?V 2 /4J. 

7. [Mil] Find the variance of the quantity B in Program S, as a function of N, 
assuming random input. 

► 8. [24] Show that if the search for max (Ki , . . . , Kj) in step S2 is carried out by 
examining keys in left-to-right order K\, K 2 , . . . , K, . instead of going from right to 
left as in Program S, it is often possible to reduce the number of comparisons needed 
on the next iteration of step S2. Write a MIX program based on this observation. 

9. [M25] What is the average number of comparisons performed by the algorithm 
of exercise 8, for random input? 

10. [12] What will be the configuration of the tree in Fig. 23 after 14 of the original 
16 items have been output? 

11. [10] What will be the configuration of the tree in Fig. 24 after the element 908 
has been output? 

12. [M20] How many times will -oo be compared with -oo when the bottom-up 
method of Fig. 23 is used to sort a file of 2 n elements into order? 

13. [20] (J. W. J. Williams.) Step H4 of Algorithm H distinguishes between the three 
cases j < r , j = r , and j > r. Show that if K > K r + 1 it would be possible to simplify 
step H4 so that only a two-way branch is made. How could the condition K > K r +i 
be ensured throughout the heapsort process, by modifying step H2? 

14. [10] Show that simple queues are special cases of priority queues. (Explain how 
keys can be assigned to the elements so that a largest-in-first-out procedure is equivalent 
to first-in-first-out.) Is a stack also a special case of a priority queue? 

► 15. [M22] (B. A. Chartres.) Design a high-speed algorithm that builds a table of 
the prime numbers < N, making use of a priority queue to avoid division operations. 
[Hint: Let the smallest key in the priority queue be the least odd nonprime number 
greater than the last odd number considered as a prime candidate. Try to minimize 
the number of elements in the queue.] 

16. [20] Design an efficient algorithm that inserts a new key into a given heap of 
n elements, producing a heap of n + 1 elements. 

17. [20] The algorithm of exercise 16 can be used for heap creation, instead of the 
“decrease l to 1” method used in Algorithm H. Do both methods create the same heap 
when they begin with the same input file? 

► 18. [21] (R. W. Floyd.) During the selection phase of heapsort, the key K tends to 
be quite small, so that nearly all of the comparisons in step H6 find K < Kj. Show 
how to modify the algorithm so that K is not compared with Kj in the main loop of 
the computation, thereby nearly cutting the average number of comparisons in half. 


5.2.3 


SORTING BY SELECTION 157 


19. [21] Design an algorithm that deletes a given element of a heap of length N, 
producing a heap of length N — 1. 

20. [M20] Prove that (14) gives the special subtree sizes in a heap. 

21. [ M 24 j Prove that (15) gives the nonspecial subtree sizes in a heap. 

► 22. [20] What permutations of {1, 2, 3, 4, 5} are transformed into 5 3 4 1 2 by the heap- 
creation phase of Algorithm H? 

23. [M28] (a) Prove that the length of scan, B, in a siftup algorithm never exceeds 
[lg(r/Z)J. (b) According to (8), B can never exceed N[\gN\ in any particular appli- 
cation of Algorithm H. Find the maximum value of B as a function of N, taken over 
all possible input files. (You must prove that an input file exists such that B takes on 
this maximum value.) 

24. [M24 ] Derive an exact formula for the standard deviation of B' N (the total length 
of scan during the heap-creation phase of Algorithm H). 

25. [ M20 ] What is the average value of the contribution to C made during the siftup 
pass when l = 1 and r == N, if N = 2 n+1 — 1? 

26. [MSO] Solve exercise 25, (a) for N = 26, (b) for general N. 

27 . [M25] (T. Clausen, 1828.) Prove that 


E 

n> 1 


X 


n 


1 — x n 


E 

n> 1 


l + x n 
1 - x n 


n 


X 


2 


(Setting x = | gives a very rapidly converging series for the evaluation of (19).) 

28. [35] Explore the idea of ternary heaps, based on complete ternary trees instead 
of binary trees. Do ternary heaps sort faster than binary heaps? 

29. [26] (W. S. Brown.) Design an algorithm for multiplication of polynomials or 

power series (aix n + a 2 x 12 + • • • )(b ix J1 + b 2 x j2 + •••), in which the coefficients of 

the answer cix tl+J1 + • • • are generated in order as the input coefficients are being 

multiplied. [Hint: Use an appropriate priority queue.] 

► 30. [HM35] (R. Schaffer and R. Sedgewick.) Let h nrn be the number of heaps on 

the elements {1,2, ...,n} for which the selection phase of heapsort does exactly m 

promotions. Prove that h nm < 2 m I~[fe=2 ^8 &, and use this relation to show that the 
average number of promotions performed by Algorithm H is IVTg N + 0(N log log N). 

31. [37] (J. W. J. Williams.) Show that if two heaps are placed “back to back” in a 
suitable way, it is possible to maintain a structure in which either the smallest or the 
largest element can be deleted at any time in O(logn) steps. (Such a structure may be 
called a priority deque.) 

32. [M28] Prove that the number of heapsort promotions, B, is always at least 
\ NlgN + O(N), if the keys being sorted are distinct. Hint: Consider the movement 
of the largest \N/ 2] keys. 

33. [21] Design an algorithm that merges two disjoint priority queues, represented 
as leftist trees, into one. (In particular, if one of the given queues contains a single 
element, your algorithm will insert it into the other queue.) 

34. [M41] How many leftist trees with N nodes are possible, ignoring the KEY values? 
The sequence begins 1, 1, 2, 4, 8, 17, 38, 87, 203, 482, 1160, . . . ; show that the number 
is asymptotically ab N N~ 3 ^ 2 for suitable constants a and b, using techniques like those 
of exercise 2. 3. 4. 4-4. 


158 SORTING 


5.2.3 


35 . [26] If UP links are added to a leftist tree (see the discussion of triply linked trees in 
Section 6.2.3), it is possible to delete an arbitrary node P from within the priority queue 
as follows: Replace P by the merger of LEFT(P) and RIGHT (P); then adjust the DIST 
fields of P’s ancestors, possibly swapping left and right subtrees, until either reaching 
the root or reaching a node whose DIST is unchanged. 

Prove that this process never requires changing more than O(logTV) of the DIST 
fields, if there are N nodes in the tree, even though the tree may contain very long 
upward paths. 

36 . [IS] (Least-recently-used page replacement.) Many operating systems make use of 
the following type of algorithm: A collection of nodes is subjected to two operations, 
(i) “using” a node, and (ii) replacing the least-recently-used node by a new node. What 
data structure makes it easy to ascertain the least-recently-used node? 

37 . [HM32] Let e N (k) be the expected treewise distance of the kth-largest element 
from the root, in a random heap of N elements, and let e(k) = limjv-+oo e N (k). Thus 
e(l) = 0, e(2) = 1, e(3) = 1.5, and e(4) = 1.875. Find the asymptotic value of e(k) to 
within 0(k _1 ). 

38 . [M21] Find a simple recurrence relation for the multiset M N of subtree sizes in a 
heap or in a complete binary tree with N internal nodes. 

5.2.4. Sorting by Merging 

Merging (or collating) means the combination of two or more ordered files into 
a single ordered file. For example, we can merge the two files 503 703 765 and 
087 512 677 to obtain 087 503 512 677 703 765. A simple way to accomplish this 
is to compare the two smallest items, output the smallest, and then repeat the 
same process. Starting with 

f 503 703 765 
\ 087 512 677 

I 503 703 765 
{ 512 677 

f 703 765 
\ 512 677 

f 703 765 
\ 677 

and so on. Some care is necessary when one of the two files becomes exhausted; 
a detailed description of the process appears in the following algorithm: 

Algorithm M ( Two-way m.erge). This algorithm merges nonempty ordered files 
xi <x 2 < ■ ■ • <x m and 2/1 < 2/2 < ■ • • < y n into a single file Z\ < z 2 < ■ ■ ■ < z m+n . 

Ml. [Initialize.] Set i <- 1, j 1, k 1. 

M2. [Find smaller.] If Xi < y h go to step M3, otherwise go to M5. 


we obtain 


then 


and 


087 


087 503 


087 503 512 


5.2.4 


SORTING BY MERGING 


159 



Fig. 29. Merging x\ < ■ ■ ■ < x m with yi < • ■ • < y n - 


M3. [Output Xi] Set z k <— Xi , k <— k + 1, i i + 1. If i < m, return to M2. 

M4. [Transmit yj,. . . , y n .] Set (zk, . . . , z m+n ) <— (yj, . . . , y n ) and terminate the 
algorithm. 

M5. [Output yj.] Set Zk t— yj, k •<— k + 1, j j + 1. If j < n, return to M2. 

M6. [Transmit Xi, . . . ,x m .] Set (zk , . . . , z m + n ) <— (Xj, . . . ,x m ) and terminate 
the algorithm. | 

We shall see in Section 5.3.2 that this straightforward procedure is essentially 
the best possible way to merge on a conventional computer, when m k, n. (On 
the other hand, when m is much smaller than n, it is possible to devise more 
efficient merging algorithms, although they are rather complicated in general.) 
Algorithm M could be made slightly simpler without much loss of efficiency by 
placing sentinel elements x m+ i = y n +i = oo at the end of the input files, stopping 
just before oo is output. For an analysis of Algorithm M, see exercise 2. 

The total amount of work involved in Algorithm M is essentially propor- 
tional to m + n, so it is clear that merging is a simpler problem than sorting. 
Furthermore, we can reduce the problem of sorting to merging, because we can 
repeatedly merge longer and longer subfiles until everything is in sort. We may 
consider this to be an extension of the idea of insertion sorting: Inserting a new 
element into a sorted file is the special case n = 1 of merging. If we want to 
speed up the insertion process we can consider inserting several elements at a 
time, “batching” them, and this leads naturally to the general idea of merge 
sorting. From a historical point of view, merge sorting was one of the very first 
methods proposed for computer sorting; it was suggested by John von Neumann 
as early as 1945 (see Section 5.5). 

We shall study merging in considerable detail in Section 5.4, with regard 
to external sorting algorithms; our main concern in the present section is the 
somewhat simpler question of merge sorting within a high-speed random-access 
memory. 

Table 1 shows a merge sort that “burns the candle at both ends” in a manner 
similar to the scanning procedure we have used in quicksort and radix exchange: 
We examine the input from the left and from the right, working towards the 






160 SORTING 


5.2.4 


middle. Ignoring the top line of the table for a moment, let us consider the 
transformation from line 2 to line 3. At the left we have the ascending run 503 
703 765; at the right, reading leftwards, we have the run 087 512 677. Merging 
these two sequences leads to 087 503 512 677 703 765, which is placed at the 
left of line 3. Then the keys 061 612 908 in line 2 are merged with 170 509 897, 
and the result (061 170 509 612 897 908) is recorded at the right end of line 3. 
Finally, 154 275 426 653 is merged with 653 — discovering the overlap before it 
causes any harm — and the result is placed at the left, following the previous run. 
Line 2 of the table was formed in the same way from the original input in line 1 . 


Table 1 

NATURAL TWO-WAY MERGE SORTING 


503 

087 

512 

061 

908 

170 

897 

275 

[6531 

426 

154 

509 

612 

677 

[765 

703 

503 

703 

765 

061 

612 

908 

154 

275 

426 

653 

1 897 

509 

170 

[677 

512 

087 

087 

503 

512 

677 

703 

765 

154 

275 

426 

653 

( 908 

897 

612 

509 

170 

061 

061 

087 

170 

503 

509 

512 

612 

677 

703 

765 

897 

908 

653 

426 

275 

154 

061 

087 

154 

170 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 


Vertical lines in Table 1 represent the boundaries between runs. They are the 
so-called stepdowns, where a smaller element follows a larger one in the direction 
of reading. We generally encounter an ambiguous situation in the middle of the 
file, when we read the same key from both directions; this causes no problem if we 
are a little bit careful as in the following algorithm. The method is traditionally 
called a “natural” merge because it makes use of the runs that occur naturally 
in its input. 

Algorithm N ( Natural two-way merge sort). Records Ri,...,R n are sorted 
using two areas of memory, each of which is capable of holding N records. For 
convenience, we shall say that the records of the second area are Rn+i, • ■ ■ , R 2 N, 
although it is not really necessary that Rn+i be adjacent to R N . The initial 
contents of Rn+i, . . . , R 2 n are immaterial. After sorting is complete, the keys 
will be in order, K\ < ■ ■ ■ < Kn- 

Nl. [Initialize.] Set s f- 0. (When s ~ 0, we will be transferring records from 
the (Ri, . . . , Rn) area to the (Rn+i, ■ ■ ■ , R 2 n) area; when s — 1, we will 
be going the other way.) 

N2. [Prepare for pass.] If s - 0, set i +- 1, j 4 - N, k 4 - N + 1, l <- 2 N; if 
s — 1, set i 4 — N + 1 , j 4 - 2N , k 4 — 1 , l 4 — N. (Variables i, j , k, l point to 
the current positions in the “source files” being read and the “destination 
files” being written.) Set d 4 - 1, / 4 - 1. (Variable d gives the current 
direction of output; / is set to zero if future passes are necessary.) 

N3. [Compare K, -Kj.} If Ki > Kj, go to step N8. If i = j, set R k 4- R, and 
go to N13. 


5.2.4 SORTING BY MERGING 161 



Fig. 30. Merge sorting. 


N4. [Transmit Ri] (Steps N4-N7 are analogous to steps M3-M4 of Algo- 
rithm M.) Set Rk 4— Ri, k 4— k + d. 

N5. [Stepdown?] Increase i by 1. Then if Ki-\ < Ki, go back to step N3. 

N6. [Transmit Rj.] Set Rk 4— Rj, k <— k + d. 

N7. [Stepdown?] Decrease j by 1. Then if Kj+ 1 < Kj, go back to step N6; 
otherwise go to step N12. 

N8. [Transmit Rj.] (Steps N8-N11 are dual to steps N4-N7.) Set Rk 4— Rj, 
h 4 — k T d. 

N9. [Stepdown?] Decrease j by 1. Then if K J+ 1 < Kj, go back to step N3. 
N10. [Transmit Ri] Set Rk 4— R. t , k 4— k + d. 

Nil. [Stepdown?] Increase i by 1. Then if Ki _ i < Ki, go back to step N10. 
N12. [Switch sides.] Set f 4— 0, d 4— —d, and interchange k 4-> l. Return to 
step N3. 

N13. [Switch areas.] If f = 0, set s 4— 1 — s and return to N2. Otherwise sorting 
is complete; if s = 0, set (Ri,...,Rn) 4— (Rn+i, . . . , R 2 n)- (This last 
copying operation is unnecessary if it is acceptable to have the output in 
(Rn+i, . . . , R 2 n) about half of the time.) | 

This algorithm contains one tricky feature that is explained in exercise 5. 

It would not be difficult to program Algorithm N for MIX, but we can 
deduce the essential facts of its behavior without constructing the entire program. 
The number of ascending runs in the input will be about | N, under random 
conditions, since we have Ki > K, f | with probability detailed information 
about the number of runs, under slightly different hypotheses, has been derived 


P 


i 











162 SORTING 


5.2.4 


in Section 5.1.3. Each pass cuts the number of runs in half (except in unusual 
cases such as the situation in exercise 6). So the number of passes will usually be 
about lg — lg N— 1. Each pass requires us to transmit each of the N records, 
and by exercise 2 most of the time is spent in steps N3, N4, N5, N8, N9. We 
can sketch the time in the inner loop as follows, if we assume that there is low 


probability of equal keys: 

Step Operations Time 

N3 CMPA, JG. JE 3.5u 

f N4 STA, INC 3u 

Either < 

\ N5 INC, LDA. CMPA, JGE 6 u 

0r f N8 STX, INC 3u 

1 \ N9 DEC, LDX, CMPX, JGE 6 u 


Thus about 12.5 m is spent on each record in each pass, and the total running 
time will be asymptotically 12.5IVlgIV , for both the average case and the worst 
case. This is slower than quicksort’s average time, and it may not be enough 
better than heapsort to justify taking twice as much memory space, since the 
asymptotic running time of Program 5.2.3H is never more than l&N\gN. 

The boundary lines between runs are determined in Algorithm N entirely by 
stepdowns. This has the possible advantage that input files with a preponderance 
of increasing order can be handled very quickly, and so can input files with 
a preponderance of decreasing order; but it slows down the main loop of the 
calculation. Instead of testing stepdowns, we can determine the length of runs 
artificially, by saying that all runs in the input have length 1, all runs after the 
first pass (except possibly the last run) have length 2, . . . , all runs after k passes 
(except possibly the last run) have length 2 k . This is called a straight two- merge, 
as opposed to the “natural” merge in Algorithm N. 

Straight two-way merging is very similar to Algorithm N, and it has essen- 
tially the same flow chart; but things are sufficiently different that we had better 
write down the whole algorithm again: 

Algorithm S ( Straight two-way merge sort). Records Ri,...,Rn are sorted 
using two memory areas as in Algorithm N. 

51. [Initialize.] Set s 4 — 0, p <— 1. (For the significance of variables s, i, j , k. 
I, and d, see Algorithm N. Here p represents the size of ascending runs to 
be merged on the current pass; further variables q and r will keep track of 
the number of unmerged items in a run.) 

52. [Prepare for pass.] If s = 0, set i 1, j <- N, k «— N, l <- 2 N + 1; if s = 1, 
set i 4 — N + 1 , j 4 — 2 N, k 4 — 0, l 4 — N + 1. Then set d 4 — 1, q 4 — p, r 4 — p. 

53. [Compare Ki : Kj.) If Ki > K Jy go to step S8. 

54. [Transmit /?,.] Set k 4 — k + d, R *. 4 — Ri. 

55. [End of run?] Set i «— i + 1, q q — 1. If q > 0, go back to step S3. 

56. [Transmit R r ] Set k 4— k + d. Then if k = l, go to step S13; otherwise set 

Rk Rj ■ 


5.2.4 


SORTING BY MERGING 163 


Table 2 

STRAIGHT TWO-WAY MERGE SORTING 


503 | 087 | 512 | 061 | 908 | 170 | 897 | 275 | 653 | 426 j 154 [ 509 | 612 j 677 | 765 | 703 

503 703 | 512 677 | 509 908 | 426 897 | 653 275 | 170 154 | 612 061 | 765 087 

087 503 703 765 | 154 170 509 908 1 897 653 426 275 | 677 612 512 061 

061 087 503 512 612 677 703 765 1 908 897 653 509 426 275 170 154 

061 087 154 170 275 426 503 509 512 612 653 677 703 765 897 908 


57. [End of run?] Set j 4- j - 1, r 4- r - 1. If r > 0, go back to step S6; 
otherwise go to S12. 

58. [Transmit R 3 .] Set k 4— k + d, R k 4- Rj. 

59. [End of run?] Set j 4- j - 1, r 4- r - 1. If r > 0, go back to step S3. 

510. [Transmit i?;.] Set k 4— k + d. Then if k = /, go to step S13; otherwise set 

Rk t— R t . 

511. [End of run?] Set i 4- i + 1, q 4- q - 1. If q > 0, go back to step S10. 

512. [Switch sides.] Set q 4— p, r 4— p, d 4— — d, and interchange k 4-4 l. If 
j — i < p, return to step S10; otherwise return to S3. 

513. [Switch areas.] Set p <- p + p. If p < N, set s 4- 1 - s and return to S2. 
Otherwise sorting is complete; if s = 0, set 

{Ri ■> ■ ■ ■ i Rn) <— (Rn+i, ■ ■ ■ , R 2 N )• 

(The latter copying operation will be done if and only if [lg N] is odd, or in 
the trivial case N = 1, regardless of the distribution of the input. Therefore 
it is possible to predict the location of the sorted output in advance, and 
copying will usually be unnecessary.) | 

An example of this algorithm appears in Table 2. It is somewhat amazing 
that the method works properly when N is not a power of 2; the r un s being 
merged are not all of length 2 k , yet no provision has apparently been made for 
the exceptions! (See exercise 8.) The former tests for stepdowns have been 
replaced by decrementing q or r and testing the result for zero; this reduces the 
asymptotic MIX running time to 111V lg N units, slightly faster than we were able 
to achieve with Algorithm N. 

In practice it would be worthwhile to combine Algorithm S with straight 
insertion; we can sort groups of, say, 16 items using straight insertion, in place of 
the first four passes of Algorithm S, thereby avoiding the comparatively wasteful 
bookkeeping operations involved in short merges. As we saw with quicksort, 
such a combination of methods does not affect the asymptotic running time, but 
it gives us a reasonable improvement nevertheless. 

Let us now study Algorithms N and S from the standpoint of data structures. 
\\ hy did we need 2 N record locations instead of N? The reason is comparatively 
simple: We were dealing with four lists of varying size (two source lists and 
two destination lists on each pass); and we were using the standard “growing 


164 SORTING 


5.2.4 


together” idea discussed in Section 2.2.2, for each pair of sequentially allocated 
lists. But half of the memory space was always unused, and a little reflection 
shows that we could really make use of a linked allocation for the four lists. If 
we add one link field to each of the N records, we can do everything required 
by the merging algorithms using simple link manipulations, without moving the 
records at all! Adding N link fields is generally better than adding the space 
needed for N more records, and the reduced record movement may also save 
us time, unless our computer memory is especially good at sequential reading 
and writing. Therefore we ought to consider also a merging algorithm like the 
following one: 

Algorithm L (List merge soi't). Records f?i, . . . ,Rn are assumed to contain 
keys A'i, . . . , K Nl together with link fields L X ,...,L N capable of holding the 
numbers — (N + 1) through (N + 1). There are two auxiliary link fields L 0 and 
Ln+ i in artificial records R 0 and Rn+i at the beginning and end of the file. This 
algorithm is a “list sort” that sets the link fields so that the records are linked 
together in ascending order. After sorting is complete, L 0 will be the index of 
the record with the smallest key; and L k , for l < k < N, will be the index of the 
record that follows R k , or L k = 0 if R k is the record with the largest key. (See 
Eq. 5.2.1-(i3).) 

During the course of this algorithm, R 0 and Rn+i serve as list heads for two 
linear lists whose sublists are being merged. A negative link denotes the end of 
a sublist known to be ordered; a zero link denotes the end of the entire list. We 
assume that N > 2. 

The notation “|L a | <— p" means “Set L s to p or —p, retaining the previous 
sign of L,” This operation is well-suited to MIX, but unfortunately not to most 
computers; it is possible to modify the algorithm in straightforward ways to 
obtain an equally efficient method for most other machines. 

LI. [Prepare two lists.] Set L 0 <- 1, L N+1 <- 2 , L t i (i + 2) for 1 < i < N -2, 

and Ln-i t— L N ■(— 0. (We have created two lists containing R 1 ,R 3 , f? 5 , . . . 
and f? 2 , Ri, Re, ■ ■ ■ , respectively; the negative links indicate that each or- 
dered sublist consists of one element only. For another way to do this step, 
taking advantage of ordering that may be present in the initial data, see 
exercise 12.) 

L2. [Begin new pass.] Set s <- 0, t <- N + 1, p 4- L s , q <- L t . If q = 0, the 
algorithm terminates. (During each pass, p and q traverse the lists being 
merged; s usually points to the most recently processed record of the current 
sublist, while t points to the end of the previously output sublist.) 

L3. [Compare K p : K q .} If K p > K q , go to L6. 

L4. [Advance p] Set \L a \ <- p, s <- p, p «- L p . If p > 0, return to L3. 

L5. [Complete the sublist.] Set L s <- q, s <- t. Then set t <- q and q <- L q , one 
or more times, until q < 0. Finally go to L8. 

L6. [Advance q.\ (Steps L6 and L7 are dual to L4 and L5.) Set |L S | <— q, s q, 
q <— L q . If q > 0, return to L3. 


5-2.4 SORTING BY MERGING 165 

Table 3 

LIST MERGE SORTING 


j 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

Kj 

- 

503 

087 

512 

061 

908 

170 

897 

275 

653 

426 

154 

509 

612 

677 

765 

703 


Lj 

1 

-3 

-4 

-5 

-6 

-7 

-8 

-9 

-10 

-11 

-12 

-13 

-14 

-15 

-16 

0 

0 

2 

Lj 

2 

-6 

1 

-8 

3 

-10 

5 

-11 

7 

-13 

9 

12 

-16 

14 

0 

0 

15 

4 

Lj 

4 

3 

1 

-11 

2 

-13 

8 

5 

7 

0 

12 

10 

9 

14 

16 

0 

15 

6 

Lj 

4 

3 

6 

7 

2 

0 

8 

5 

1 

14 

12 

10 

13 

9 

16 

0 

15 

11 

Lj 

4 

12 

11 

13 

2 

0 

8 

5 

10 

14 

1 

6 

3 

9 

16 

7 

15 

0 


L7. [Complete the sublist.] Set L s p, s t. Then set t <- p and p <- L p , one 
or more times, until p < 0. 

L8. [End of pass?] (At this point, p < 0 and q < 0, since both pointers have 
moved to the end of their respective sublists.) Set p ■<— —p. q •<— —q. If 
q = 0, set \L S \ <— p, \L t \ <— 0 and return to L2. Otherwise return to L3. | 

An example of this algorithm in action appears in Table 3, where we can see the 
link settings each time step L2 is encountered. It is possible to rearrange the 
records Ri , . . . , R jy at the end of this algorithm so that their keys are in order, 
using the method of exercise 5.2-12. There is an interesting similarity between 
list merging and the addition of sparse polynomials (see Algorithm 2.2.4A). 

Let us now construct a MIX program for Algorithm L, to see whether the 
list manipulation is advantageous from the standpoint of speed as well as space: 

Program L (List merge sort). For convenience, we assume that records are 
one word long, with L 3 in the (0:2) field and Kj in the (3:5) field of location 


INPUT + j; rll = 

p, rI2 = q, rI3 = 

s, rI4 = t, 

rA - K q \ N >2. 

01 

L 

EQU 

0:2 


Definition of field names 

02 

ABSL 

EQU 

1:2 


'll 

03 

KEY 

EQU 

3:5 


jy 

jj1 a 

04 

START 

ENT1 

CN 

1 

S 

1 

LI. Prepare two lists. 

05 


ENNA 

2,1 

N -2 

1 

06 


STA 

INPUT, 1 (L) 

N-2 

Li 4 (i + 2). 

01 


DEC1 

1 

N - 2 

08 


J1P 

*-3 

N-2 

N - 2 > i > 0. 

09 


ENTA 

1 

1 


10 


STA 

INPUT (L) 

1 

Lo 4— 1. 

11 


ENTA 

2 

1 


12 


STA 

INPUT+N+KL) 

1 

^iV+l 4— 2. 

13 


STZ 

INPUT+N-KL) 

1 

Ln-i 4— 0. 

H 


STZ 

INPUT+N (L) 

1 

Ln •<— 0. 

15 


JMP 

L2 

1 

To L2. 

16 

L3Q 

LDA 

INPUT, 2 

C" + B' 

L3. Compare K n : K n . 

17 

L3P 

CMPA 

INPUT, 1 (KEY) 

C 


18 


JL 

L6 

c 

To L6 if K q < K p . 

19 

L4 

ST1 

INPUT, 3 (ABSL) 

C' 

L4. Advance v. 1 LA «— v. 

20 


ENT3 

0,1 

C' 

s 1 — p. 

21 


LD1 

INPUT, 1(L) 

C' 

p 1 — Lp. 

22 


J1P 

L3P 

C' 

To L3 if p > 0. 


166 


SORTING 


5.2 

23 

L5 

ST2 INPUT, 3 (L) 

B’ 

L5. Complete the sublist. L. <— a. 

24 


ENT3 0,4 

B' 

s t — t. 

25 


ENT4 0,2 

D' 

t <— q. 

26 


LD2 INPUT, 2 (L) 

D' 

q «— Lq. 

27 


J2P *-2 

D' 

Repeat if q > 0. 

28 


JMP L8 

B' 

To L8. 

29 

L6 

ST2 INPUT , 3 (ABSL) 

C" 

L6. Advance a. L. | q. 

30 


ENT3 0,2 

C" 

s t — q. 

31 


LD2 INPUT, 2 (L) 

c" 

q <— L q . 

32 


J2P L3Q 

C" 

To L3 if q > 0. 

33 

L7 

ST1 INPUT, 3 (L) 

B" 

L7. Complete the sublist. L, <— v. 

34 


ENT 3 0,4 

B" 

s <— t. 

35 


ENT4 0,1 

D" 

t <— p. 

36 


LD1 INPUT, 1 (L) 

D" 

V L p . 

37 


J1P *-2 

D" 

Repeat if p > 0. 

38 

L8 

ENN1 0,1 

B 

L8. End of pass? v < p. 

39 


ENN2 0,2 

B 

q < — q- 

40 


J2NZ L3Q 

B 

To L3 if q ^ 0. 

41 


ST1 INPUT, 3 (ABSL) 

A 

\L S \ p. 

42 


STZ INPUT, 4 (ABSL) 

A 

\L t \ <-0. 

43 

L2 

ENT3 0 

A+ 1 

L2. Begin new pass, s <— 0. 

44 


ENT4 N+l 

-4 + 1 

t <r- N+l. 

45 


LD1 INPUT (L) 

A+ 1 

P L s . 

46 


LD2 INPUT+N+KL) 

A + l 

q <- L t . 

47 


J2NZ L3Q 

A+ 1 

To L3 if q ^ 0. | 


The running time of this program can be deduced using techniques we have 
seen many times before (see exercises 13 and 14); it comes to approximately 
( 1 OiV lg N + 4.92N)u on the average, with a small standard deviation of order 
VN. Exercise 15 shows that the running time can in fact be reduced to about 
(81V lg IV) it, at the expense of a substantially longer program. 

Thus we have a clear victory for linked-memory techniques over sequential 
allocation, when internal merging is being done: Less memory space is required, 
and the program runs about 10 to 20 percent faster. Similar algorithms have 
been published by L. J. Woodrum [IBM Systems J. 8 (1969), 189-203] and 
A. D. Woodall [Comp. J. 13 (1970), 110-111], 

EXERCISES 

1. [21] Generalize Algorithm M to a k-way merge of the input files xa < • ■ • < x im 
for i = 1, 2, . . . , k. 

2. [M24] Assuming that each of the ( m J] n ) possible arrangements of m x’s among 
71 y s is equally likely, find the mean and standard deviation of the number of times 
step M2 is performed during Algorithm M. What are the maximum and minimum 
values of this quantity? 

► 3. [20] ( Updating . ) Given records It \ , . . . , Rm and It ] . . . . , R' N whose keys are dis- 
tinct and in order, so that Ki < ■ • • < Km and K[ < • • • < K' N , show how to modify 
Algorithm M to obtain a merged file in which records R, of the first file have been 
discarded if their keys appear also in the second file. 


5.2.4 


SORTING BY MERGING 167 


4 . [21] The text observes that merge sorting may be regarded as a generalization 
of insertion sorting. Show that merge sorting is also strongly related to tree selection 
sorting as depicted in Fig. 23. 

► 5 . [21] Prove that i can never be equal to j in steps N6 or N10. (Therefore it is 
unnecessary to test for a possible jump to N13 in those steps.) 

6. [22] Find a permutation AT K 2 ... Ki e of {1, 2, ..., 16} such that 

K 2 > K,u K.i > K&, K 6 > K 7 , Kg > K g , K w < K lu K 12 < K 13 , K 14 < K w , 

yet Algorithm N will sort the file in only two passes. (Since there are eight or more 
runs, we would expect to have at least four runs after the first pass, two runs after 
the second pass, and sorting would ordinarily not be complete until after at least three 
passes. How can we get by with only two passes?) 

7. [16] Give a formula for the exact number of passes required by Algorithm S, as a 
function of N. 

8. [22] During Algorithm S, the variables q and r are supposed to represent the 
lengths of the unmerged elements in the runs currently being processed; q and r both 
start out equal to p, while the runs are not always this long. How can this possibly 
work? 

9. [24] Write a MIX program for Algorithm S. Specify the instruction frequencies in 
terms of quantities analogous to A, B' , B", C' , . . . in Program L. 

10. [25] (D. A. Bell.) Show that sequentially allocated straight two-way merging can 
be done with at most ~N memory locations, instead of 2 N as in Algorithm S. 

11. [21] Is Algorithm L a stable sorting method? 

► 12 . [22] Revise step LI of Algorithm L so that the two-way merge is “natural,” taking 
advantage of ascending runs that are initially present. (In particular, if the input is 
already sorted, step L2 should terminate the algorithm immediately after your step LI 
has acted.) 

► 13 . [M3 4 ] Give an analysis of the average running time of Program L, in the style 
of other analyses in this chapter: Interpret the quantities A,B,B' , and explain 
how to compute their exact average values. How long does Program L take to sort the 
16 numbers in Table 3? 

14 . [M24] Let the binary representation of N be 2 ei +2 e2 +• • - + 2 e ‘, where ei > e 2 > 
■ • ■ > et > 0, t > 1. Prove that the maximum number of key comparisons performed 
by Algorithm L is 1 - 2 e * + YX= i( ek + & - l)2 efc . 

15 . [20] Hand simulation of Algorithm L reveals that it occasionally does redundant 
operations; the assignments |L S | <— p, [L s [ <— q in steps L4 and L6 are unnecessary 
about half of the time, since we have L s = p (or q) each time step L4 (or L6) returns 
to L3. How can Program L be improved so that this redundancy disappears? 

16. [28] Design a list merging algorithm like Algorithm L but based on three-way 
merging. 

17. [20] (J. McCarthy.) Let the binary representation of N be as in exercise 14, and 
assume that we are given N records arranged in t ordered subfiles of respective sizes 
2 ei , 2® 2 , . . . , 2 et . Show how to maintain this state of affairs when a new (N + l)st record 
is added and N <— N + l. (The resulting algorithm may be called an online merge sort.) 


168 SORTING 


5.2.4 



Fig. 31. A railway network with five “stacks.” 


18. [40] (M. A. Kronrod.) Given a file of N records containing only two runs, 

A'i < • • • < Km and Km+ 1 < ■ ■ ■ < Kn , 

is it possible to sort the file with O(N) operations in a random-access memory, using 
only a small fixed amount of additional memory space regardless of the sizes of M 
and N7 (All of the merging algorithms described in this section make use of extra 
memory space proportional to N.) 

19. [26] Consider a railway switching network with n “stacks,” as shown in Fig. 31 
when n = 5; we considered one-stack networks in exercises 2. 2. 1-2 through 2,2.1 5. If 
N railroad cars enter at the right, we observed that only comparatively few of the N\ 
permutations of those cars could appear at the left, in the one-stack case. 

In the n-stack network, assume that 2 n cars enter at the right. Prove that each 
of the 2 n ! possible permutations of these cars is achievable at the left, by a suitable 
sequence of operations. (Each stack is actually much bigger than indicated in the 
illustration — big enough to accommodate all the cars, if necessary.) 

20. [47] In the notation of exercise 2.2. 1-4, at most a£ permutations of N elements 
can be produced with an n-stack railway network; hence the number of stacks needed 
to obtain all N ! permutations is at least log N \/ log a N « log 4 N. Exercise 19 shows 
that at most [lg N] stacks are needed. What is the true rate of growth of the necessary 
number of stacks, as N — > 00 ? 

21. [23] (A. J. Smith.) Explain how to extend Algorithm L so that, in addition to 
sorting, it computes the number of inversions present in the input permutation. 

22. [28] (J. K. R. Barnett.) Develop a way to speed up merge sorting on multiword 
keys. (Exercise 5.2.2-30 considers the analogous problem for quicksort.) 

23. [M30] Exercises 13 and 14 analyze a “bottom-up” or iterative version of merge 
sort, where the cost c(N) of sorting N items satisfies the recurrence 

c(N) = c(2 k ) + c{N - 2 k ) + f{2 k ,N — 2 k ) for 2 k < N < 2 fe+1 

and f(m,n) is the cost of merging m things with n. Study the “top-down” or divide- 
and-conquer recurrence 

c(N) = c( r NJ 21 ) + c( [N/2] ) + /( r N/ 2] , |A/2J ) for N > 1, 
which arises when merge sort is programmed recursively. 

5.2.5. Sorting by Distribution 

We come now to an interesting class of sorting methods that are essentially the 
exact opposite of merging, when considered from a standpoint we shall discuss 



5.2.5 


SORTING BY DISTRIBUTION 169 


in Section 5.4.7. These methods were used to sort punched cards for many years, 
long before electronic computers existed. The same approach can be adapted to 
computer programming, and it is generally known as “bucket sorting,” “radix 
sorting,” or “digital sorting,” because it is based on the digits of the keys. 

Suppose we want to sort a 52-card deck of playing cards. We may define 

A<2<3<4<5<6<7<8<9< 10 <J<Q<K, 
as an ordering of the face values, and for the suits we may define 

* < 0 < V < * 

One card is to precede another if either (i) its suit is less than the other suit, or 
(ii) its suit equals the other suit but its face value is less. (This is a particular 
case of lexicographic ordering between ordered pairs of objects; see exercise 5-2.) 
Thus 

A*<2*<>--<K*<AO<’-’<Q4KK*. 

We could sort the cards by any of the methods already discussed. Card 
players often use a technique somewhat analogous to the idea behind radix 
exchange: First they divide the cards into four piles, according to suit, then 
they fiddle with each individual pile until everything is in order. 

But there is a faster way to do the trick! First deal the cards face up into 
13 piles, one for each face value. Then collect these piles by putting the aces 
on the bottom, the 2s face up on top of them, then the 3s, etc., finally putting 
the kings (face up) on top. Turn the deck face down and deal again, this time 
into four piles for the four suits. (Again you turn the cards face up as you deal 
them.) By putting the resulting piles together, with clubs on the bottom, then 
diamonds, hearts, and spades, you’ll get the deck in perfect order. 

The same idea applies to the sorting of numbers and alphabetic data. Why 
does it work? Because (in our playing card example) if two cards go into different 
piles in the final deal, they have different suits, so the one with the lower suit is 
lowest. But if two cards have the same suit (and consequently go into the same 
pile), they are already in proper order because of the previous sorting. In other 
words, the face values will be in increasing order, on each of the four piles, as we 
deal the cards on the second pass. The same proof can be abstracted to show 
that any lexicographic ordering can be sorted in this way; for details, see the 
answer to exercise 5-2, at the beginning of this chapter. 

The sorting method just described is not immediately obvious, and it isn’t 
clear who first discovered the fact that it works so conveniently. A 19-page 
pamphlet entitled “The Inventory Simplified,” published by the Tabulating Ma- 
chines Company division of IBM in 1923, presented an interesting Digit Plan 
method for forming sums of products on their Electric Sorting Machine: Suppose, 
for example, that we want to multiply the number punched in columns 1-10 
by the number punched in columns 23-25, and to sum all of these products 
for a large number of cards. We can sort first on column 25, then use the 
Tabulating Machine to find the quantities aj, 02 , . . . , ag, where a /. is the total 


170 SORTING 


5.2.5 


of columns 1 10 summed over all cards having k in column 25. Then we can 
sort on column 24, finding the analogous totals b±, b 2 , . . . , 69 ; also on column 23. 
obtaining C|.c 2 e 9 . The desired sum of products is easily seen to be 

®i + 2a 2 + • • • + 9ag + 106j + 20 & 2 + • • • + 905g + 100ci + 200c 2 + •••-(- 900cg. 

This punched-card tabulating method leads naturally to the discovery of least- 
significant-digit-first radix sorting, so it probably became known to the machine 
operators. The first published reference to this principle for sorting appears in 

L. J. Comrie’s early discussion of punched-card equipment [Transactions of the 
Office Machinery Users’ Assoc., Ltd. (1929), 25-37, especially page 28]. 

In order to handle radix sorting inside a computer, we must decide what to 
do with the piles. Suppose that there are M piles; we could set aside M areas of 
memory, moving each record from an input area into its appropriate pile area. 
But this is unsatisfactory, since each area must be large enough to hold N items, 
and (M + 1 )N record spaces would be required. Therefore most people rejected 
the idea of radix sorting within a computer, until H. H. Seward [Master’s thesis, 

M. I.T. Digital Computer Laboratory Report R-232 (1954), 25-28] pointed out 
that we can achieve the same effect with only 2 N record areas and M count fields. 
We simply count how many elements will lie in each of the M piles, by making 
a preliminary pass over the data; this tells us precisely how to allocate memory 
for the piles. We have already made use of the same idea in the “distribution 
counting sort,” Algorithm 5. 2D. 

Thus radix sorting can be carried out as follows: Start with a distribution 
sort based on the least significant digit of the keys (in radix M notation), moving 
records from the input area to an auxiliary area. Then do another distribution 
sort, on the next least significant digit, moving the records back into the original 
input area; and so on, until the final pass (on the most significant digit) puts all 
records into the desired order. 

If we have a decimal computer with 12-digit keys, and if N is rather large, we 
can choose M — 1000 (considering three decimal digits as one radix-1000 digit); 
then sorting will be complete in four passes, regardless of the size of N. Similarly, 
if we have a binary computer and a 40-bit key, we can set M = 1024 = 2 10 and 
complete the sorting in four passes. Actually each pass consists of three parts 
(counting, allocating, moving); E. H. Friend [JACM 3 (1956), 151] suggested 
combining two of those parts at the expense of M more memory locations, by 
accumulating the counts for pass k + 1 while moving the records on pass k. 

Table 1 shows how such a radix sort can be applied to our 16 example 
numbers, with M = 10. Radix sorting is generally not useful for such small N. 
so a small example like this is intended to illustrate the sufficiency rather than 
the efficiency of the method. 

An alert, “modern” reader will note, however, that the whole idea of mak- 
ing digit counts for the storage allocation is tied to old-fashioned ideas about 
sequential data representation. We know that linked allocation is specifically 
designed to handle a set of tables of variable size, so it is natural to choose a 
linked data structure for radix sorting. Since we traverse each pile serially, all 


5.2.5 


SORTING BY DISTRIBUTION 171 


Table 1 

RADIX SORTING 


Input area contents: 503 087 512 061 

Counts for units digit distribution: 

Storage allocations based on these counts: 
Auxiliary area contents: 170 061 512 612 
Counts for tens digit distribution: 

Storage allocations based on these counts: 
Input area contents: 503 703 908 509 

Counts for hundreds digit distribution: 
Storage allocations based on these counts: 
Auxiliary area contents: 061 087 154 170 


908 170 897 275 653 426 154 509 612 677 765 703 
1123121311 

1 2 4 7 8 10 11 14 15 16 

503 653 703 154 275 765 426 087 897 677 908 509 

4210022311 

4 6 7 7 7 9 11 14 15 16 

512 612 426 653 154 061 765 170 275 677 087 897 

2210133211 

2 4 5 5 6 9 12 14 15 16 

275 426 503 509 512 612 653 677 703 765 897 908 


we need is a single link from each item to its successor. Furthermore, we never 
need to move the records; we merely adjust the links and proceed merrily down 
the lists. The amount of memory required is (1 + e)N + 2 eM records, where e 
is the amount of space taken up by a link field. Formal details of this procedure 
are rather interesting since they furnish an excellent example of typical data 
structure manipulations, combining sequential and linked allocation: 

Algorithm R ( Radix list sort). Records R x , . . . , are each assumed to contain 
a LINK held. Their keys are assumed to be p-tuples 

(ai,a 2 , . . . ,a p ), 0 < a* < M, (l) 

where the order is defined lexicographically so that 

)o X , 0,2 , • ■ - , Cip ) (&1 , * j bp ) (2 ) 

if and only if for some j, 1 < j < p, we have 

Oi — bi for all i < j, but a j < b j- (3) 

The keys may, in particular, be thought of as numbers written in radix M 

notation, 

a x M v ' + 02M p • ■ • + Op— X M + a p , ( 4 ) 

and in this case lexicographic order corresponds to the normal ordering of non- 
negative numbers. The keys may also be strings of alphabetic letters, etc. 

Sorting is done by keeping M “piles” of records, in a manner that exactly 
parallels the action of a card sorting machine. The piles are really queues in the 
sense of Chapter 2, since we link them together so that they are traversed in a 
hrst-in-hrst-out manner. There are two pointer variables TOP [ 7 ] and BOTMfi] 
for each pile, 0 < i < M, and we assume as in Chapter 2 that 


LINK (L0C (B0TM [i] ) ) = BOTM [j] . 


(5) 


172 SORTING 


5.2.5 



Fig. 32. Radix list sort. 


Rl. [Loop on k.] In the beginning, set P 4- LOCKRjvR a pointer to the last 
record. Then perform steps R2 through R6 for k = 1, 2, . . . , p. (Steps R2 
through R6 constitute one “pass.”) Then the algorithm terminates, with P 
pointing to the record with the smallest key, LINK(P) to the record with next 
smallest, then LINK(LINK(P) ) , etc.; the LINK in the final record will be A. 

R2. [Set piles empty.] Set T0P[i] 4- L0C(B0TM[i]) and BOTM [i] 4- A. for 
0 < i < M. 

R3. [Extract kth digit of key.] Let KEY (P) , the key in the record referenced by P. 
be (ai, a 2 , . . . , a p )\ set i 4— a p+1 _ fc , the fcth least significant digit of this key. 

R4. [Adjust links.] Set LINK(T0P[i]) 4- P, then set T0P[i] 4- P. 

R5. [Step to next record.] If k = 1 (the first pass) and if P = L0C( Rj) , for some 
j ^ 1, set P 4— L0C( Rj~i) and return to R3. If k > 1 (subsequent passes), 
set P 4— LINK(P) , and return to R3 if P /: A. 

R6. [Do Algorithm H.] (We are now done distributing all elements onto the 
piles.) Perform Algorithm H below, which “hooks together” the individual 
piles into one list, in preparation for the next pass. Then set P 4 - B0TM[0] . 
a pointer to the first element of the hooked-up list. (See exercise 3.) | 

Algorithm H ( Hooking-up of queues). Given M queues, linked according to 

the conventions of Algorithm R, this algorithm adjusts at most M links so that 

a single queue is created, with BOTM [0] pointing to the first element, and with 

pile 0 preceding pile 1 . . . preceding pile M — l. 

HI. [Initialize.] Set *4—0. 

H2. [Point to top of pile.] Set P 4— TOP [*] . 

H3. [Next pile.] Increase ?' by 1. If i — M, set LINK(P) 4— A and terminate the 
algorithm. 

H4. [Is pile empty?] If BOTM [*] = A, go back to H3. 

H5. [Tie piles together.] Set LINK(P) 4- BOTM [?'] . Return to H2. | 







5.2.5 


SORTING BY DISTRIBUTION 173 


Figure 33 shows the contents of the piles after each of the three passes, when 
our 16 example numbers are sorted with M = 10. Algorithm R is very easy to 
program for MIX, once a suitable way to treat the pass-by-pass variation of steps 
R3 and R5 has been found. The following program does this without sacrificing 
any speed in the inner loop, by overlaying two of the instructions. Note that 
TOP [i] and BOTM [?.] can be packed into the same word. 


TOP [0] TOP [1] TOP [2] TOP [3] TOP [4] TOP [5] TOP [6] TOP [7] TOP [8] TOP [9] 



B0TMC0] B0TM[1] BOTM [2] BOTM [3] BOTM [4] BOTM [5] BOTM [6] BOTM [7] BOTM [8] BOTM [9] 


TOP [0] TOP [1] TOP [2] TOP [3] TOP [4] TOP [5] TOP [6] TOP [7] TOP [8] TOP [9] 



BOTM [0] BOTM [1] BOTM [2] BOTM [3] BOTM [4] BOTM [5] BOTM [6] BOTM [7] BOTM [8] BOTM [9] 


J_ X 


TOP [0] TOP [1] TOP [2] TOP [3] TOP [4] TOP [5] TOP [6] TOP [7] TOP [8] TOP [9] 



BOTMCO] BOTM [1] B0TM[2] B0TM[3] B0TM[4] B0TM[5] BOTM [6] B0TM[7] B0TM[8] B0TM[9] 


Fig. 33. Radix sort using linked allocation: contents of the ten piles after each pass. 

Program R ( Radix list sort). The given records in locations INPUT+1 through 
INPUT+N are assumed to have p — 3 components ( 01 , 02 ,( 13 ) stored respectively 
in the (1:1), (2:2), and (3:3) fields. (Thus M is assumed to be less than or 
equal to the byte size of MIX.) The (4:5) field of each record is its LINK. We 
let TOP [i] x PILES + 7(1:2) and BOTM [1] = PILES + 1(4:5), for 0 < i < M. It 
is convenient to make links relative to location INPUT, so that L0C(B0TM[i]) = 
PILES + 1 — INPUT; to avoid negative links we therefore want the PILES table to be 


174 SORTING 


5.2.5 


in higher locations than the INPUT table. Index registers are assigned as follows: 
rll = P, rI2 = i, rI3 = 3 — k, rI4 = T0P[i] ; during Algorithm H, rI2 = i — M. 


01 

LINK 

EQU 

4:5 



02 

TOP 

EQU 

1:2 



03 

START 

ENT1 

N 

1 

Rl. Loop on k. P = LOC(Rw). 

04 


ENT3 

2 

1 

k +T 1. 

05 

2H 

ENT2 

M-l 

3 

R2. Set piles empty ■ 

06 


ENTA 

PILES-INPUT , 2 

3 M 

L0C(B0TM [i] ) 

07 


STA 

PILES, 2 (TOP) 

3 M 

-> TOP [i] . 

08 


STZ 

PILES, 2 (LINK) 

3M 

B0TM[i] +- A. 

09 


DEC2 

1 

3 M 


10 


J2NN 

*-4 

3 M 

M > i > 0. 

11 


LDA 

R3SW,3 

3 


12 


STA 

3F 

3 

Modify instructions for pass k. 

13 


LDA 

R5SW , 3 

3 


U 


STA 

5F 

3 


15 

3H 

[LD2 

INPUT, 1(3: 3)] 


R3. Extract kth disit of key. 

16 

4H 

LD4 

PILES, 2 (TOP) 

3 N 

R4. Adjust links. 

17 


ST1 

INPUT, 4 (LINK) 

3 N 

LINK (TOP [i] ) +- P. 

18 


ST1 

PILES, 2 (TOP) 

3 N 

TOP [i] <- P. 

19 

5H 

[DEC1 

1] 


R5. Step to next record. 

20 


J1NZ 

3B 

3N 

To R3 if end of pass. 

21 

6H 

ENN2 

M 

3 

R6. Do Alsorithm H. 

22 


JMP 

7F 

3 

To H2 with it— 0. 

23 

R3SW 

LD2 

INPUT, 1(1:1) 

N 

Instruction for R3 when k = 3. 

24 


LD2 

INPUT, 1(2: 2) 

N 

Instruction for R3 when k = 2. 

25 


LD2 

INPUT, 1(3:3) 

N 

Instruction for R3 when k = 1. 

26 

R5SW 

LD1 

INPUT, 1 (LINK) 

N 

Instruction for R5 when k — 3. 

27 


LD1 

INPUT, 1 (LINK) 

N 

Instruction for R5 when k = 2. 

28 


DEC1 

1 

N 

Instruction for R5 when k = 1. 

29 

9H 

LDA 

PILES+M , 2 (LINK) 

3M-3 

H4. Is pile empty? 

30 


JAZ 

8F 

3M-3 

To H3 if BOTM [i] = A. 

31 


STA 

INPUT, 1 (LINK) 

3M-3-E 

115. Tie piles tosether. 

32 

7H 

LD1 

PILES+M, 2 (TOP) 

3M - E 

H2. Point to top of pile. 

33 

8H 

INC2 

1 

3 M 

H3. Next pile, i t— i + 1. 

34 


J2NZ 

9B 

3 M 

To HI if i / M. 

35 


STZ 

INPUT, 1 (LINK) 

3 

LINK(P) <- A. 

36 


LD1 

PILES (LINK) 

3 

P «- BOTM [0] . 

37 


DEC3 

1 

3 


38 


J3NN 

2B 

3 

Loop for 1 < k < 3. | 


The running time of Program R is 32 N + 48M + 38 — 4E, where N is the 
number of input records, M is the radix (the number of piles), and E is the 
number of occurrences of empty piles. This compares very favorably with other 
programs we have constructed based on similar assumptions (Programs 5.2.1M, 
5.2.4L). A p-pass version of the program would take (lip — 1 )N + 0(pM) units 
of time; the critical factor in the timing is the inner loop, which involves five 
references to memory and one branch. On a typical computer we will have 
M — b r and p = [ where t is the number of radix-6 digits in the keys; 


5.2.5 


SORTING BY DISTRIBUTION 175 


increasing r will decrease p, so the formulas can be used to determine a best 
value of r. 

The only variable in the timing is E, the number of empty piles observed 
in step H4. If we consider each of the M N sequences of radix- M digits to be 
equally probable, we know from our study of the “poker test” in Section 3. 3. 2D 
that there are M - r empty piles with probability 

M(M — 1) ... (Af-r+1) f IV 1 

M N \ r J ^ 6) 

on each pass, where {^} is a Stirling number of the second kind. By exercise 6, 

£=(min max(M — N , 0)p, ave p , max (M-l)pV ( 7 ) 

An ever-increasing number of “pipeline” or “number-crunching” computers 
have appeared in recent years. These machines have multiple arithmetic uni ts 
and look-ahead circuitry so that memory references and computation can be 
highly overlapped; but their efficiency deteriorates noticeably in the presence of 
conditional branch instructions unless the branch almost always goes the same 
way. The inner loop of a radix sort is well adapted to such machines, because 
it is a straight iterative calculation of typical number-crunching form. Therefore 
radix sorting is usually more efficient than any other known method for internal 
sorting on such machines , provided that N is not too small and the keys are not 
too long. 

Of course, radix sorting is not very efficient when the keys are extremely 
long. For example, imagine sorting 60-digit decimal numbers with 20 passes of a 
radix sort, using M = 10 3 ; very few pairs of numbers will tend to have identical 
keys in their leading 9 digits, so the first 17 passes accomplish very little. In our 
analysis of radix exchange sorting, we found that it was unnecessary to inspect 
many bits of the key, when we looked at the keys from the left instead of the 
right. Let us therefore reconsider the idea of a radix sort that starts at the most 
significant digit (MSD) instead of the least significant digit (LSD). 

We have already remarked that an MSD-first radix method suggests itself 
naturally; in fact, it is not hard to see why the post office uses such a method 
to sort mail. A large collection of letters can be sorted into separate bags for 
different geographical areas; each of these bags then contains a smaller number 
of letters that can be sorted independently of the other bags, into finer and 
finer geographical divisions. (Indeed, bags of letters can be transported nearer 
to their destinations before they are sorted further, or as they are being sorted 
further.) This principle of “divide and conquer” is quite appealing, and the 
only reason it doesn’t work especially well for sorting punched cards is that it 
ultimately spends too much time fussing with very small piles. Algorithm R is 
relatively efficient, even though it considers LSD first, since we never have more 
than M piles, and the piles need to be hooked together only p times. On the 
other hand, it is not difficult to design an MSD-first radix method using linked 
memory, with negative links as in Algorithm 5.2.4L to denote the boundaries 


176 SORTING 


5.2.5 


between piles. (See exercise 10.) The main difficulty is that empty piles tend to 
proliferate and to consume a great deal of time in an MSD-first method. 

Perhaps the best compromise has been suggested by M. D. MacLaren [JACM 
13 (1966), 404-411], who recommends an LSD-first sort as in Algorithm R, but 
applied only to the most significant digits. This does not completely sort the file, 
but it usually brings the file very nearly into order so that very few inversions 
remain; therefore straight insertion can be used to finish up. Our analysis of 
Program 5.2. 1M applies also to this situation, so that if the keys are uniformly 
distributed we will have an average of ^N(N — 1 )M~ P inversions remaining in 
the file after sorting on the leading p digits. (See Eq. 5.2.1-(i7) and exercise 
5.2.1-38.) MacLaren has computed the average number of memory references 
per item sorted, and the optimum choice of M and p (assuming that M is 
a power of 2, that the keys are uniformly distributed, and that N/M p <0.1 
so that deviations from uniformity are tolerable) turns out to be given by the 


following table: 

N = 100 

1000 

10000 

100000 

1000000 

10 7 

10 8 

10 9 

best M = 32 

128 

512 

1024 

8192 

2 15 

2 17 

2 19 

best p — 2 

2 

2 

2 

2 

2 

2 

2 

P(N) = 19.3 

18.5 

18.2 

18.1 

18.0 

18.0 

18.0 

18.0 


Here fi(N) denotes the average number of memory references per item sorted. 


/3(N) = 5p + 8 + 


2pM 

N 


N - 1 
2 M p 


Hn . 
N ’ 


( 8 ) 


it is bounded as N — > oo, if we take p = 2 and M > \/N, so the average sorting 
time is actually O(N) instead of order A log IV. This method is an improvement 
over multiple list insertion (Program 5.2.1M), which is essentially the case p— 1. 
Exercise 12 gives MacLaren’s interesting procedure for final rearrangement of a 
partially list-sorted file. 

It is also possible to avoid the link fields, using the methods of Algo- 
rithm 5. 2D and exercise 5.2-13, so that only 0(\/]V) memory locations are 
needed in addition to the space required for the records themselves. The average 
sorting time is proportional to N if the input records are uniformly distributed. 

W. Dobosiewicz obtained good results by using an MSD-first distribution 
sort until reaching short subfiles, with the distribution process constrained so 
that the first M/2 piles were guaranteed to receive between 25% and 75% of the 
records [see Inf. Proc. Letters 7 (1978), 1-6; 8 (1979), 170-172]; this ensured 
that the average time to sort uniform keys would be O(N) while the worst case 
would be 0(N log N). His papers inspired several other researchers to devise 
new address calculation algorithms, of which the most instructive is perhaps the 
following 2-level scheme due to Markku Tamminen [J. Algorithms 6 (1985), 138- 
144]: Assume that all keys are fractions in the interval [0. . 1). First distribute 
the N records into [IV/ 8J bins by mapping key K into bin [KN/ 8j . Then suppose 
bin k has received TV*, records; if N k < 16, sort it by straight insertion, otherwise 


5.2.5 


SORTING BY DISTRIBUTION 


177 


sort it by a MacLaren-like distribution-plus-insertion sort into M 2 bins, where 
M 2 ss lOA^fc. Tamminen proved the following remarkable result: 

Theorem T. There is a constant T such that the sorting method just de- 
scribed performs at most TN operations on the average , whenever the keys 
are independent random numbers whose density function f(x) is bounded and 
Riemann-integrable for 0 < x < 1. (The constant T does not depend on /.) 

Proof. See exercise 18. Intuitively, the first distribution into N/8 piles finds 
intervals in which / is approximately constant; the second distribution will then 
make the expected bin size approximately constant. | 

Several versions of radix sort that have been well tuned for sorting large 
arrays of alphabetic strings are described in an instructive article by P. M. 
Mcllroy, K. Bostic, and M. D. Mcllroy, Computing Systems 6 (1993), 5-27. 

EXERCISES 

► 1. [20] The algorithm of exercise 5.2-13 shows how to do a distribution sort with 
only N record areas (and M count fields), instead of 2 N record areas. Does this lead 
to an improvement over the radix sorting algorithm illustrated in Table 1? 

2. [13] Is Algorithm R a stable sorting method? 

3. [15] Explain why Algorithm H makes B0TM[0] point to the first record in the 
"hooked-up” queue, even though pile 0 might he empty. 

► 4. [23] Algorithm R keeps the M piles linked together as queues (first-in-first-out). 
Explore the idea of linking the piles as stacks instead. (The arrows in Fig. 33 would 
go downward instead of upward, and the BOTM table would be unnecessary.) Show that 
if the piles are “hooked together” in an appropriate order, it is possible to achieve a 
valid sorting method. Does this lead to a simpler or a faster algorithm? 

5. [20] What changes are necessary to Program R so that it sorts eight-byte keys 
instead of three-byte keys? Assume that the most significant bytes of Kt are stored in 
location KEY+i (1 : 5), while the three least significant bytes are in location INPUT+i (1 :3) 
as presently. What is the running time of the program, after these changes have been 
made? 

6. [M24] Let g m n ( z) = J2PMNkZ k , where p\iNk is the probability that exactly k 
empty piles are present after a random radix-sort pass puts N elements into M piles. 

a) Show that g M (N+i)(z) = qmn^z) + ((1 - z)/M) g' MN {z). 

b) Use this relation to find simple expressions for the mean and variance of this 
probability distribution, as a function of M and N. 

7. [20] Discuss the similarities and differences between Algorithm R and radix ex- 
change sorting (Algorithm 5.2.2R). 

► 8. [20] The radix-sorting algorithms discussed in the text assume that all keys being 
sorted are nonnegative. What changes should be made to the algorithms when the keys 
are numbers expressed in two’s complement or ones’ complement notation? 

9. [20] Continuing exercise 8, what changes should be made to the algorithms when 
the keys are numbers expressed in signed magnitude notation? 


178 SORTING 


5.2.5 


10 . [30] Design an efficient most-significant-digit-first radix-sorting algorithm that 
uses linked memory. (As the size of the subfiles decreases, it is wise to decrease M, and 
to use a nonradix method on the really short subfiles.) 

11 . [16] The sixteen input numbers shown in Table 1 start with 41 inversions; after 
sorting is complete, of course, there are no inversions remaining. How many inversions 
would be present in the file if we omitted pass 1, doing a radix sort only on the tens 
and hundreds digits? How many inversions would be present if we omitted both pass 1 
and pass 2? 

12 . [24] (M. D. MacLaren.) Suppose that Algorithm R has been applied only to the 
p leading digits of the actual keys; thus the file is nearly sorted when we read it in 
the order of the links, but keys that agree in their first p digits may be out of order. 
Design an algorithm that rearranges the records in place so that their keys are in order, 
AT < K 2 < • • • < Kn- [Hint: The special case that the file is perfectly sorted appears 
in the answer to exercise 5.2-12; it is possible to combine this with straight insertion 
without loss of efficiency, since few inversions remain in the file.] 

13 . [40] Implement the internal sorting method suggested in the text at the close of 
this section, producing a subroutine that sorts random data in O(N) units of time with 
only 0(\/~N) additional memory locations. 

14 . [22] The sequence of playing cards 



can be sorted into increasing order A 2 ... J IJ K from top to bottom in two passes, 
using just two piles for intermediate storage: Deal the cards face down into two piles 
containing respectively A 2 9 3 10 and 4J56QK78 (from bottom to top); then put 
the second pile on the first, turn the deck face up, and deal into two piles A2345678. 
9 10 J Q K. Combine these piles, turn them face up, and you’re done. 

Prove that this sequence of cards cannot be sorted into decreasing order K Q J . . . 2 A 
from top to bottom in two passes, even if you are allowed to use up to three piles for 
intermediate storage. (Dealing must always be from the top of the deck, turning the 
cards face down as they are dealt. Top to bottom is right to left in the illustration.) 

15. [M25] Consider the problem of exercise 14 when all cards must be dealt face up 
instead of face down. Thus, one pass can be used to convert increasing order into 
decreasing order. How many passes are required? 

► 16. [25] Design an algorithm to sort strings ai, . . . , a„ on an m-letter alphabet into 
lexicographic order. The total running time of your algorithm should be 0(m + n + N), 
where N = |ai| + • • • + |a n | is the total length of all the strings. 



5.2.5 


SORTING BY DISTRIBUTION 179 


17. {15} In the two-level distribution sort proposed by Tamminen (see Theorem T), 
why is a MacLaren-like method used for the second level of distribution but not the 
first level? 

18. [HM26] Prove Theorem T. Hint: Show first that MacLaren’s distribution-plus- 
insertion algorithm does O(BN) operations, on the average, when it is applied to 
independent random keys whose probability density function satisfies f(x) < B for 
0 < x < 1. 


For sorting the roots and words 
we had the use of 1100 lozenge boxes, 
and used trays for the forms. 
— GEORGE V. WIGRAM (1843) 


180 


SORTING 


5.3 


5.3. OPTIMUM SORTING 

Now THAT WE have analyzed a great many methods for internal sorting, it is 
time to turn to a broader question: What is the best possible way to sort ? Can 
we place limits on the maximum sorting speeds that will ever be achievable, no 
matter how clever a programmer might be? 

Of course there is no best possible way to sort; we must define precisely 
what is meant by “best,” and there is no best possible way to define “best.” 
We have discussed similar questions about the theoretical optimality of algo- 
rithms in Sections 4.3.3, 4.6.3, and 4.6.4, where high-precision multiplication 
and polynomial evaluation were considered. In each case it was necessary to 
formulate a rather simple definition of a “best possible” algorithm, in order to 
give sufficient structure to the problem to make it workable. And in each case 
we ran into interesting problems that are so difficult they still haven’t been 
completely resolved. The same situation holds for sorting; some very interesting 
discoveries have been made, but many fascinating questions remain unanswered. 

Studies of the inherent complexity of sorting have usually been directed 
towards minimizing the number of times we make comparisons between keys 
while sorting n items, or merging m items with n, or selecting the t. th largest of an 
unordered set of n items. Sections 5.3.1, 5.3.2, and 5.3.3 discuss these questions 
in general, and Section 5.3.4 deals with similar issues under the interesting 
restriction that the pattern of comparisons must essentially be fixed in advance. 
Several other types of interesting theoretical questions related to optimum sorting 
appear in the exercises for Section 5.3.4, and in the discussion of external sorting 
(Sections 5.4.4, 5.4.8, and 5.4.9). 

/As soon as an Analytical Engine exists, 
it will necessarily guide the future course of the science. 

Whenever any result is sought by its aid, 
the question will then arise — 
By what course of calculation can these 
results be arrived at by the machine 
in the shortest time? 

— CHARLES BABBAGE (1864) 


5.3.1. Minimum-Comparison Sorting 

The minimum number of key comparisons needed to sort n elements is obviously 
zero , because we have seen radix methods that do no comparisons at all. In fact, 
it is possible to write MIX programs that are able to sort, although they contain 
no conditional jump instructions at all! (See exercise 5-8 at the beginning of this 
chapter.) We have also seen several sorting methods that are based essentially 
on comparisons of keys, yet their running time in practice is dominated by other 
considerations such as data movement, housekeeping operations, etc. 

Therefore it is clear that comparison counting is not the only way to measure 
the effectiveness of a sorting method. But it is fun to scrutinize the number of 
comparisons anyway, since a theoretical study of this subject gives us a good 


5.3.1 


MINIMUM-COMPARISON SORTING 


181 


Level 0 

Level 1 


Level 2 


Level 3 

Fig. 34. A comparison tree for sorting three elements. 

deal of useful insight into the nature of sorting processes, and it also helps us to 
sharpen our wits for the more mundane problems that confront us at other times. 

In order to rule out radix-sorting methods, which do no comparisons at 
all, we shall restrict our discussion to sorting techniques that are based solely 
on an abstract linear ordering relation “<” between keys, as discussed at the 
beginning of this chapter. For simplicity, we shall also confine our discussion to 
the case of distinct keys, so that there are only two possible outcomes of any 
comparison of K, versus Ky. either Ki < Kj or Ki > Kj. (For an extension 
of the theory to the general case where equal keys are allowed, see exercises 3 
through 12. For bounds on the worst-case running time that is needed to sort 
integers without the restriction to comparison-based methods, see Fredman and 
Willard, J. Computer and Syst. Sci. 47 (1993), 424-436; Ben-Amram and Galil, 
J. Comp. Syst. Sci. 54 (1997), 345-370; Thorup, SODA 9 (1998), 550-555.) 

The problem of sorting by comparisons can also be expressed in other 
equivalent ways. Given a set of n distinct weights and a balance scale, we can 
ask for the least number of weighings necessary to completely rank the weights in 
order of magnitude, when the pans of the balance scale can each accommodate 
only one weight. Alternatively, given a set of n players in a tournament, we 
can ask for the smallest number of games that suffice to rank all contestants, 
assuming that the strengths of the players can be linearly ordered (with no ties). 

All n-element sorting methods that satisfy the constraints above can be 
represented in terms of an extended binary tree structure such as that shown 
in Fig. 34. Each internal node (drawn as a circle) contains two indices “ i:j ” 
denoting a comparison of Ki versus Kj. The left subtree of this node represents 
the subsequent comparisons to be made if AT, < Kj, and the right subtree 
represents the actions to be taken when Ki > Kj . Each external node of the tree 
(drawn as a box) contains a permutation Oi o 2 ...a n of {1,2,..., n}, denoting 
the fact that the ordering 

Aoi ^ K a2 <C * * * < Ka n 

has been established. (If we look at the path from the root to this external node, 
each of the n - 1 relationships K ai < K a , +1 for 1 < i < n will be the result of 
some comparison aq :aj +1 or ai + i:ai on this path.) 







182 


SORTING 


5.3.1 



Fig. 35. Example of a redundant comparison. 

Thus Fig. 34 represents a sorting method that first compares Ad with A' 2 : 
if Ki > A 2 , it goes on (via the right subtree) to compare A ' 2 with A 3 , and 
then if K 2 < K 3 it compares A'i with A 3 ; finally if Ai > K 3 it knows that 
A 2 < A 3 < K\. An actual sorting algorithm will usually also move the keys 
around in the file, but we are interested here only in the comparisons, so we 
ignore all data movement. A comparison of K l with Kj in this tree always 
means the original keys A; and Kj, not the keys that might currently occupy 
the ith and jth positions of the file after the records have been shuffled around. 

It is possible to make redundant comparisons; for example, in Fig. 35 there 
is no reason to compare 3:1, since K\ < A 2 and A 2 < K 3 implies that Ki < K 3 . 
No permutation can possibly correspond to the left subtree of node 3 : 1 in Fig. 35; 
consequently that part of the algorithm will never be performed! Since we are 
interested in minimizing the number of comparisons, we may assume that no re- 
dundant comparisons are made. Hence we have an extended binary tree structure 
in which every external node corresponds to a permutation. All permutations of 
the input keys are possible, and every permutation defines a unique path from 
the root to an external node; it follows that there are exactly n\ external nodes 
in a comparison tree that sorts n elements with no redundant comparisons. 

The best worst case. The first problem that arises naturally is to find 
comparison trees that minimize the maximum number of comparisons made. 
(Later we shall consider the average number of comparisons.) 

Let S(n) be the minimum number of comparisons that will suffice to sort 
n elements. If all the internal nodes of a comparison tree are at levels < k, it is 
obvious that there can be at most 2 fc external nodes in the tree. Hence, letting 
k = S(n), we have 

n\ < 2 s < n) . 

Since S(n) is an integer, we can rewrite this formula to obtain the lower bound 

S{n) > fig n?l . ( 1 ) 

Stirling’s approximation tells us that 

flgn!] = nlgn - n/ ln 2 + | lgn + 0 ( 1 ), 
hence roughly nlgn comparisons are needed. 


( 2 ) 


5.3.1 


MINIMUM-COMPARISON SORTING 183 


Relation ( 1 ) is often called the information-theoretic lower bound, since 
cognoscenti of information theory would say that lgn! “bits of information” are 
being acquired during a sorting process; each comparison yields at most one bit of 
information. Trees such as Fig. 34 have also been called “questionnaires”; their 
mathematical properties were first explored systematically in Claude Picard’s 
book Theorie des Questionnaires (Paris: Gauthier-Villars, 1965). 

Of all the sorting methods we have seen, the three that require fewest com- 
parisons are binary insertion (see Section 5.2.1), tree selection (see Section 5.2.3), 
and straight two-way merging (see Algorithm 5.2.4L). The maximum number of 
comparisons for binary insertion is readily seen to be 


B ( n ) = = n T lgn l ~ 2r ‘ gn] + (3) 

fc= 1 


by exercise 1.2.4-42, and the maximum number of comparisons in two-way 
merging is given in exercise 5.2.4-14. We will see in Section 5.3.3 that tree 
selection has the same bound on its comparisons as either binary insertion or 
two-way merging, depending on how the tree is set up. In all three cases we 
achieve an asymptotic value of n lg n; combining these lower and upper bounds 
for S(n) proves that 


lim 


s ( n ) 

nlgn 


= 1 . 


(4) 


Thus we have an approximate formula for S(n), but it is desirable to obtain 
more precise information. The following table gives exact values of the lower 
and upper bounds discussed above, for small n: 


n=l 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

[lgn!] = 0 1 3 5 7 10 13 16 19 22 26 29 33 37 41 45 49 

B(n) = 0 1 3 5 8 11 14 17 21 25 29 33 37 41 45 49 54 

L(n) = 0 1 3 5 9 11 14 17 25 27 30 33 38 41 45 49 65 

Here B(n) and L(n) refer respectively to binary insertion and two-way list 
merging. It can be shown that B(n) < L(n) for all n (see exercise 2). 

From the table above, we can see that 5(4) = 5, but 5(5) might be either 
7 or 8. This brings us back to a problem stated at the beginning of Section 5.2: 
What is the best way to sort five elements? Can five elements be sorted using 
only seven comparisons? 

The answer is yes, but a seven-step procedure is not especially easy to 
discover. We begin by first comparing K x : A 2 , then K 3 .K 4 , then the larger 
elements of these pairs. This produces a configuration that may be diagrammed 

b d 

n . 

ace 


to indicate that a < b < d and c < d. (It is convenient to represent known 
ordering relations between elements by drawing directed graphs such as this, 


184 SORTING 


5.3.1 


where x is known to be less than y if and only if there is a path from x to y in 
the graph.) At this point we insert the fifth element K 5 — e into its proper place 
among {a, b. d}; only two comparisons are needed, since we may compare it first 
with b and then with a or d. This leaves one of four possibilities, 

b d e b d bed b d e 

_// / / /7~~ <•> 

e a c a c a c a c 

and in each case we can insert c among the remaining elements less than d in 
one or two more comparisons. This method for sorting five elements was first 
found by H. B. Demuth [Ph.D. thesis, Stanford University ( 1956 ), 41 - 43 ]. 

Merge insertion. A pleasant generalization of the method above has been 
discovered by Lester Ford, Jr. and Selmer Johnson. Since it involves some aspects 
of merging and some aspects of insertion, we shall call it merge insertion. For 
example, consider the problem of sorting 21 elements. We start by comparing 
the ten pairs A'i : K 2 , A3 : K.\, . . . , /\ i9 : A' 2 o ; then we sort the ten larger elements 
of the pairs, using merge insertion. As a result we obtain the configuration 


“1 a. 2 a 3 a4 ag ae 07 ag ag a 10 

rr 77 77 77 7'/ . 

bi fe 2 63 64 bg be 67 bg bg fei 0 fen 

analogous to (5). The next step is to insert 63 among {61, ai, a 2 }, then b 2 among 
the other elements less than a 2 ; we arrive at the configuration 


ci c 2 C3 C4 C5 eg a 4 a 5 ae a 7 ag ag a 10 

* ~TTTTTT7. ^ 

64 bg be 67 bg bg feio fen 

Let us call the upper-line elements the main chain. We can insert 65 into its 
proper place in the main chain, using three comparisons (first comparing it to 
C4, then c 2 or c 6 , etc.); then 64 can be moved into the main chain in three more 
steps, leading to 


d\ 


d-2 d ,4 c/5 de dj ds dg 

>• >• >• >• >♦- 


d 10 « 6 0-7 os ag aio 

• ••••• 

be 67 bs bg 610 6n 


(9) 


The next step is crucial; is it clear what to do? We insert b n ( not b 7 ) into the 
main chain, using only four comparisons. Then 6 10 , b 9 , b & , 67, b 6 (in this order) 
can also be inserted into their proper places in the main chain, using at most 
four comparisons each. 

A careful count of the comparisons involved here shows that the 21 elements 
have been sorted in at most 10 + S(10) + 2 + 2 + 3 + 3 + 4 + 4 + 4 + 4 + 4 + 4 — 66 
steps. Since 


2 65 < 21 ! < 2 66 


5.3.1 


MINIMUM-COMPARISON SORTING 185 


we also know that no fewer than 66 would be possible in any event; hence 

5 ( 21 ) = 66. (10) 

(Binary insertion would have required 74 comparisons.) 

In general, merge insertion proceeds as follows for n elements: 

i) Make pairwise comparisons of [n/ 2 j disjoint pairs of elements. (If n is odd, 
leave one element out.) 

ii) Sort the [n/ 2 J larger numbers, found in step (i), by merge insertion. 

iii) Name the elements a x ,a 2 , ■ . . , a\n/2\ > &i, b 2 , • • • , 6|- n / 2 ] as in (7)1 where a x < 
02 < • • • < a\_ n / 2j and b z < at for 1 < i < [n/2j; call b x and the o’s the 
"main chain.” Insert the remaining 6’s into the main chain, using binary 
insertion, in the following order, leaving out all bj for j > \n/ 2 ]: 

h,l> 2 ; 65,64; 6n, 6 10 , . . . , 6 6 ; ...; b tk , 6 tfc _i, . . . , b tk _ 1 + i; (11) 

We wish to define the sequence (f x , t 2 , t 3 , t 4 , . . . ) = ( 1 , 3 , 5 , 11 ,...), which 
appears in (11), in such a way that each of b tk , b tk -i, . „ t? 6 tfe _ 1+ i can be inserted 
into the main chain with at most k comparisons. Generalizing (7), (8), and (9), 
we obtain the diagram 

XI x 2 X2t k _ ! at k _ 1 + 1 at k _ 1+ 2 a tk - 1 

—r-7 — r~/ 

b t k -i + 1 b t k - 1+2 h tk - 1 b tk 

where the main chain up to and including a tk -i contains 2 tk~\ + (tk — tfc-i — 1) 
elements. This number must be less than 2 fe ; our best bet is to set it equal to 
2 k — 1, so that 

tk - 1 +tk = 2 fe . (12) 

Since t x = 1 , we may set t 0 — 1 for convenience, and we find that 

t k = 2 k - t k - 1 = 2 k - 2 k ~ l + t k - 2 = • ■ • - 2 k - 2 fc_1 + ■ • ■ + (— l) fc 2° 

— (2 fc+1 + ( — l ) fc )/3 (13) 

by summing a geometric series. (Curiously, this same sequence arose in our 
study of an algorithm for calculating the greatest common divisor of two integers; 
see exercise 4 . 5 . 2 - 36 .) 

Let F(n) be the number of comparisons required to sort n elements by merge 
insertion. Clearly 

F(n)=[n/ 2 \+F([n/ 2 \) + G(\n/ 2 -]), (14) 

where G represents the amount of work involved in step (iii). If tk-i < rri < t k , 
we have 

k - 1 

G(m) = y^Jjtj - tj-i) + k(m - t k -i) = km-(t 0 + t 1 + • •• + t k -i), (15) 
j = 1 


186 SORTING 


5.3.1 


summing by parts. Let us set 


w k — to + t\ + • • • + = [2 fc+1 /3j 


(16) 


so that ( wq , w \ , W 2 , w 3, W 4 , . . . ) — (0, 1, 2, 5, 10, 21, . . . ). Exercise 13 shows that 
F ( n ) - F(n - 1) = k if and only if w k < n < w k+ i, (17) 
and the latter condition is equivalent to 


2^+1 2 fc + 2 
— - — < n < 


3 ’ 


or k + 1 < lg 3n < k + 2; hence 


F(n) -F(n- 1) = [lgfn]. (18) 

(This formula is due to A. Hadian [Ph.D. thesis, Univ. of Minnesota (1969), 
38-42].) It follows that F(n) has a remarkably simple expression, 

F{n) = ^2\\glk], ( 19 ) 

k = 1 

quite similar to the corresponding formula (3) for binary insertion. A closed 
form for this sum appears in exercise 14. 

Equation (19) makes it easy to construct a table of F(n); we have 

n=l 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

("lgn!]= 0 1 3 5 7 10 13 16 19 22 26 29 33 37 41 45 49 

F{n)= 0 1 3 5 7 10 13 16 19 22 26 30 34 38 42 46 50 

n= 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

flgn!]— 53 57 62 66 70 75 80 84 89 94 98 103 108 113 118 123 

F(n) = 54 58 62 66 71 76 81 86 91 96 101 106 111 116 121 126 

Notice that F(n) ~ [lgn!] for 1 < n < 11 and for 20 < n < 21, so we know that 

merge insertion is optimum for those n: 


S(n) = flgn!] = F(n) for n = 1, . . . , 11, 20, and 21. (20) 

Hugo Steinhaus posed the problem of finding S(n) in the second edition of his 
classic book Mathematical Snapshots (Oxford University Press, 1950), 38-39. He 
described the method of binary insertion, which is the best possible way to sort n 
objects if we start by sorting n- 1 of them first before the nth is considered; and 
he conjectured that binary insertion would be optimum in general. Several years 
later [Calcutta Math. Soc. Golden Jubilee Commemoration 2 (1959), 323-327], 
he reported that two of his colleagues, S. Trybula and P. Czen, had “recently” 
disproved his conjecture, and that they had determined S(n) for n < 11. Trybula 
and Czen may have independently discovered the method of merge insertion, 
which was published soon afterwards by Ford and Johnson [AMM 66 (1959) 
387-389], 

After the discovery of merge insertion, the first unknown value of S(n) was 
5(12). Table 1 shows that 12! is quite close to 2 29 , hence the existence of a 


5.3.1 


MINIMUM-COMPARISON SORTING 187 


Table 1 

VALUES OF FACTORIALS IN BINARY NOTATION 


(1) 
( 10 ) 
( 110 ) 
( 11000 ) 
( 1111000 ) 
( 1011010000 ) 
( 1001110110000 ) 
( 1001110110000000 ) 
( 1011000100110000000 ) 
( 1101110101111100000000 ) 
( 10011000010001010100000000 ) 
( 11100100011001111110000000000 ) 
( 101110011001010001100110000000000 ) 
( 1010001001100001110110010100000000000 ) 
( 10011000001110111011101110101100000000000 ) 
( 100110000011 101110111011 101011000000000000000 ) 
( 1010000110111111011101110110011011000000000000000 ) 
( 10110101111101110110011001010011100110000000000000000 ) 
( 110110000001010111001001100000110100010010000000000000000 ) 
( 10000111000011011001110111110010000010101101000000000000000000 ) 


2 = 1! 

2 = 2! 

2 = 3! 
2=4! 

2 = 5 ! 

2 = 6! 

2 = 7 ! 

2 = 8! 

2 = 9 ! 

2 = 10 ! 
2 = 11 ! 
2 = 12 ! 
2 = 13 ! 
2 = 14 ! 
2 = 15 ! 
2 = 16 ! 
2 = 17 ! 
2 = 18 ! 
2 = 19 ! 
2 = 20 ! 


29-step sorting procedure for 12 elements is somewhat unlikely. An exhaustive 
search (about 60 hours on a Maniac II computer) was therefore carried out by 
Mark Wells, who discovered that 5(12) = 30 [Proc. IFIP Congress 65 2 (1965), 
497-498; Elements of Combinatorial Computing (Pergamon, 1971), 213-215]. 
Thus the merge insertion procedure turns out to be optimum for n— 12 as well. 

*A slightly deeper analysis. In order to study S(n) more carefully, let us look 
more closely at partial ordering diagrams such as (5). After several comparisons 
have been made, we can represent the knowledge we have gained in terms of a 
directed graph. This directed graph contains no cycles, in view of the transitivity 
of the < relation, so we can draw it in such a way that all arcs go from left to 
right; it is therefore convenient to leave arrows off the diagram. In this way (5) 
becomes 


b d 



ace 


If G is such a directed graph, let T(G ) be the number of permutations consistent 
with G, that is, the number of ways to assign the integers {1, 2 , . . . , n} to the 
vertices of G so that the number on vertex x is less than the number on vertex 
V whenever x -A y in G. For example, one of the permutations consistent with 
(21) has a = 1, b = 4, c = 2, d = 5, e = 3. We have studied T(G) for various G 
in Section 5.1.4, where we observed that T(G) is the number of ways in which 
G can be sorted topologically. 


188 SORTING 


5.3.1 


If G is a graph on n elements that can be obtained after k comparisons, we 
define the efficiency of G to be 

E(G)= 2 W(G)' (22) 

(This idea is due to Frank Hwang and Shen Lin.) Strictly speaking, the efficiency 
is not a function of the graph G alone, it depends on the way we arrived at G 
during a sorting process, but it is convenient to be a little careless in our language. 
After making one more comparison, between elements i and j, we obtain two 
graphs G, and G 2 , one for the case K t < Kj and one for the case K, > K, 
Clearly 

T(G) = T(Gi) + T(G 2 ). 

If T(G\) > T(G 2 ), we have 


E(G r) 


T(G) < 2T(Gr), 

= E(G)T(G) 

2 k + 1 T{G 1 ) 2T(Gi) ~ K ’ 


( 23 ) 


Therefore each comparison leads to at least one graph of less or equal efficiency; 
we can’t improve the efficiency by making further comparisons. 

When G has no arcs at all, we have k = 0 and T(G) = n\, so the initial 
efficiency is 1. At the other extreme, when G is a graph representing the final 
result of sorting, G looks like a straight line and T(G) = 1. Thus, for example, 
if we want to find a sorting procedure that sorts five elements in at most seven 
steps, we must obtain the linear graph » — • — • — • — *, whose efficiency is 5! / (2 7 x 1) = 
120/128 = 15/16. It follows that all of the graphs arising in the sorting procedure 
must have efficiency > yf ; if an y less efficient graph were to appear, at least one 
of its descendants would also be less efficient, and we would ultimately reach 
a linear graph whose efficiency is < ||. In general, this argument proves that 
all graphs corresponding to the tree nodes of a sorting procedure for n elements 
must have efficiency > n\/ 2 l , where l is the number of levels of the tree (not 
counting external nodes). This is another way to prove that S(n) > [lgn!], 
although the argument is not really much different from what we said before. 

The graph ( 21 ) has efficiency 1, since T(G) = 15 and since G has been 
obtained in three comparisons. In order to see what vertices should be compared 
next, we can form the comparison matrix 


where C tJ is 



a 

b 

c 

d 

e 

a 

(0 

15 

10 

15 

n\ 


b 

0 

0 

5 

15 

7 


C(G) = c 

5 

10 

0 

15 

9 

5 

d 

0 

0 

0 

0 

3 


e 

U 

8 

6 

12 

0) 


for the graph G 1 

obtained by 

adding 


For example, if we compare K c with K e 


( 24 ) 


arc i — ► j to G. 
the 15 permutations consistent with G 


5.3.1 


MINIMUM-COMPARISON SORTING 


189 


split up into C ec — 6 having K e < K c and C ce = 9 having K c < K e . The 
latter graph would have efficiency 15/(2 x 9) = | < so it could not lead to a 
seven-step sorting procedure. The next comparison must be K b : K e in order to 
keep the efficiency > 

The concept of efficiency is especially useful when we consider the connected 
components of graphs. Consider for example the graph 


a b d e 



f 9 


with no arcs connecting G' to G", so it has been formed by making some 
comparisons entirely within G ' and others entirely within G". In general, assume 
that G = G' © G" has no arcs between G' and G ", where G' and G" have 
respectively n' and n" vertices; it is easy to see that 

T(G)= ( n ' + n "'jT(G')T(G"), ( 25 ) 


since each consistent permutation of G is obtained by choosing n' elements 
to assign to G' and then making consistent permutations within G' and G" 
independently. If k' comparisons have been made within G' and k" within G", 
we have the basic result 


E(G) = 


{n! +n")\ 
2 fc '+ fc "T(G) 


2 k 'T(G') 2 k "T(G") 


E{G’)E{G"), 


( 26 ) 


showing that the efficiency of a graph is related in a simple way to the efficiency 
of its components. Therefore we may restrict consideration to graphs having 
only one component. 

Now suppose that G' and G" are one-component graphs, and suppose that 
we want to hook them together by comparing a vertex x of G' with a vertex y 
of G We want to know how efficient this will be. For this purpose we need a 
function that can be denoted by 



defined to be the number of permutations consistent with the graph 


( 2 7) 


ai &2 Op fljn 



b , 


bn 


bi 62 


(28) 


190 SORTING 


5.3.1 


Thus ( ^ < q n ) is ( m r ^ n ) times the probability that the pth smallest of a set of 
m numbers is less than the gth smallest of an independently chosen set of n 
numbers. Exercise 17 shows that we can express (^<®) in two ways in terms 
of binomial coefficients, 


P Q 

m n 


)= E ( 


to — p + n — k\ ( p — 1 + k' 


0<k<q 


E ( 

p<j<m 


lP~ 1 + K\ 
TO — p J V p — 1 / 

n-q + m-j\fq-l+j\ 
n — q J V q— 1 / 


(29) 


(Incidentally, it is by no means obvious on algebraic grounds that these two sums 
of products of binomial coefficients should come out to be equal.) We also have 
the formulas 


( p 

\m 


< 



p \ /m + n\ 
mj V m / 


(30) 


\n mj \ 


m+l — p n+l — q 

< 

m n 


(31) 


( P < 9 ) = ( P i< q ) + ( P < \) +[P<m][q = n}( m + n X ). 

V rn nj \m- 1 n J \ rn n — lj \ rn / 

For definiteness, let us now consider the two graphs 


(32) 



(33) 


It is not hard to show by direct enumeration that T(G') =42 and T(G") = 5; so 
if G is the 11-vertex graph having G ' and G" as components, we have T(G) = 
( 1 4 1 ) • 42 ■ 5 = 69300 by Eq. (25). This is a formidable number of permutations 
to list, if we want to know how many of them have x; < yj for each i and j. 
But the calculation can be done by hand, in less than an hour, as follows. We 
form the matrices A(G') and A(G"), where A^ is the number of consistent 
permutations of G' (or G") in which x, (or y t ) is equal to k. Thus the number of 
permutations of G in which Xi is less than yj is the (i,p) element of A(G') times 
(y<1) times the (j,q) element of A(G”), summed over 1 < p < 7 and 1 < q < 4. 
In other words, we want to form the matrix product A(G') ■ L ■ A(G") T , where 
Lpq = (7 <4)- This comes to 


/21 

16 

5 

0 

0 

0 

°\ 

/210 

294 

322 

329 \ 

0 

5 

10 

12 

10 

5 

0 


126 

238 

301 

325 

21 

16 

5 

0 

0 

0 

0 


70 

175 

265 

315 

0 

0 

12 

18 

12 

0 

0 


35 

115 

215 

295 

0 

0 

0 

0 

5 

16 

21 


15 

65 

155 

260 

0 

5 

10 

12 

10 

5 

0 


5 

29 

92 

204 

V 0 

0 

0 

0 

5 

16 

21/ 

V 1 

8 

36 

120/ 


( 2 3 0 0\ 
2 2 0 1 I 
1 0 2 2 1 
0 0 3 2/ 


/48169 42042 
22825 16005 
48169 42042 
22110 14850 
5269 2442 
22825 16005 
V 5269 2442 


66858 64031\ 
53295 46475 
66858 64031 
54450 47190 
27258 21131 
53295 46475 
27258 21131/ 


MINIMUM-COMPARISON SORTING 191 


5.3.1 


G i • l 

G 2 • • 1 

g 3 <^1 

G 4 


g 6 <£>-• if 

If 



Gio< c*i m 

Gn, . , §f 

Gl2 _/*— • 1 

G i 3 *0^ 1 

C'm*^ 1 

G15 1 


i 


m 

G2 ° 


G -<^<i 

16 



Fig. 36. Some graphs and their efficiencies, obtained at the beginning of a long proof 
that 5(12) >29. 


Thus the “best” way to hook up G' and G" is to compare x\ with y 2 \ this gives 
42042 cases with xi < y 2 and 69300 — 42042 = 27258 cases with x\ > y 2 . (By 
symmetry, we could also compare x 3 with y 2 , x 5 with y 3 , or x 7 with y 3 , leading to 
essentially the same results.) The efficiency of the resulting graph for x t < y 2 is 


69300 

84084 


E(G')E(G"), 


which is none too good; hence it is probably a bad idea to hook G' up with G" 
in any sorting method. The point of this example is that we are able to make 
such a decision without excessive calculation. 

These ideas can be used to provide independent confirmation of Mark Wells’s 
proof that 5(12) = 30. Starting with a graph containing one vertex, we can 
repeatedly try to add a comparison to one of our graphs G or to G' ® G" (a pair 
of graph components G' and G") in such a way that the two resulting graphs 
have 12 or fewer vertices and efficiency > 12!/2 29 m 0.89221. Whenever this is 
possible, we take the resulting graph of least efficiency and add it to our set, 
unless one of the two graphs is isomorphic to a graph we already have included. 
If both of the resulting graphs have the same efficiency, we arbitrarily choose 
one of them. A graph can be identified with its dual (obtained by reversing the 
order), so long as we consider adding comparisons to G' © dual(G") as well as 
to G' © G" . A few of the smallest graphs obtained in this way are displayed in 
Fig. 36 together with their efficiencies. 

Exactly 1649 graphs were generated, by computer, before this process ter- 
minated. Since the graph * — — — — • — • — • — • — • — • — • — • — • was not obtained, we may 
conclude that 5(12) > 29. It is plausible that a similar experiment could be 
performed to deduce that 5(22) > 70 in a fairly reasonable amount of time, since 
221/2 70 ss 0.952 requires extremely high efficiency to sort in 70 steps. (Only 91 
of the 1649 graphs found on 12 or fewer vertices had such high efficiency.) 


192 SORTING 


5.3.1 


Martin Peczarski [see Algorithmica 40 (2004), 133-145; Information Proc. 
Letters 101 (2007), 126-128] extended Wells’s method and proved that 5(13) = 
34, 5(14) = 38, 5(15) = 42, 5(22) = 71; thus merge insertion is optimum 
in those cases as well. Intuitively, it seems likely that 5(16) will some day be 
shown to be less than F(16), since F(16) involves no fewer steps than sorting 
ten elements with 5(10) comparisons and then inserting six others by binary 
insertion, one at a time. There must be a way to improve upon this! But at 
present, the smallest case where F(n) is definitely known to be nonoptimum is 
n = 47: After sorting 5 and 42 elements with F(5) + F(42) = 178 comparisons, 
we can merge the results with 22 further comparisons, using a method due to 
J. Schulte Monting, Theoretical Comp. Sci. 14 (1981), 19-37; this strategy beats 
F{ 47) = 201. (Glenn K. Manacher [JACM 26 (1979), 441-456] had previously 
proved that infinitely many n exist with S(n) < F(n ), starting with n = 189.) 


The average number of comparisons. So far we have been considering 
procedures that are best possible in the sense that their worst case isn’t bad: 
in other words, we have looked for “minimax” procedures that minimize the 
maximum number of comparisons. Now let us look for a “minimean” procedure 
that minimizes the average number of comparisons, assuming that the input is 
random so that each permutation is equally likely. 

Consider once again the tree representation of a sorting procedure, as shown 
in Fig. 34. The average number of comparisons in that tree is 


2+3+3+3T3+2 

6 



averaging over all permutations. In general, the average number of comparisons 
in a sorting method is the external path length of the tree divided by n\. (Recall 
that the external path length is the sum of the distances from the root to each of 
the external nodes; see Section 2. 3. 4. 5.) It is easy to see from the considerations 
of Section 2. 3. 4. 5 that the minimum external path length occurs in a binary tree 
with N external nodes if there are 2 q - N external nodes at level q - 1 and 
2 N — 2 q at level q, where q = [IgA']. (The root is at level zero.) The minimum 
external path length is therefore 


{q - 1)(2 9 - N) + q(2N - 2«) = (q+ 1 )N - 2 q . (34) 

The minimum path length can also be characterized in another interesting way: 
An extended binary tree has minimum external path length for a given number 
of external nodes if and only if there is a number l such that all external nodes 
appear on levels l and l + 1. (See exercise 20.) 

If we set q — lg N + 9, where 0 < 6 < 1, the formula for minimum external 
path length becomes 

N(\gN + 1 + 9-2°). (35) 

The function 1 + 9 - 2° is shown in Fig. 37; for 0 < 6 < 1 it is positive but very 
small, never exceeding 


1 - (1 + In In 2)/ In 2 = 0.08607 13320 55934+. 


(36) 


5.3.1 


MINIMUM-COMPARISON SORTING 193 


0.1 



0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 


Fig. 37. The function 1 + 9 -2 e . 


Thus the minimum possible average number of comparisons, obtained by dividing 
(35) by N, is never less than lg N and never more than lg N + 0.0861. [This result 
was first obtained by A. Gleason in an internal IBM memorandum (1956).] 

Now if we set N = n\, we get a lower bound for the average number of 
comparisons in any sorting scheme. Asymptotically speaking, this lower bound is 

lg n\ + 0(1) = nlgn — n/ln2 + O(logn). (37) 

Let F(n) be the average number of comparisons performed by the merge 
insertion algorithm; we have 

n = 1 2 3 4 5 6 7 8 

lower bound (34) = 0 2 16 112 832 6896 62368 619904 

nl F(n) = 0 2 16 112 832 6912 62784 623232 

Thus merge insertion is optimum in both senses for n < 5, but for n = 6 
it averages 6912/720 = 9.6 comparisons while our lower bound says that an 
average of 6896/720 = 9.577777 . . . comparisons might be possible. A moment’s 
reflection shows why this is true: Some “fortunate” permutations of six elements 
are sorted by merge insertion after only eight comparisons, so the comparison 
tree has external nodes appearing on three levels instead of two. This forces 
the overall path length to be higher. Exercise 24 shows that it is possible to 
construct a six-element sorting procedure that requires nine or ten comparisons 
in each case; it follows that this method is superior to merge insertion, on the 
average, and no worse than merge insertion in its worst case. 

When n = 7, Y. Cesari [Thesis (Univ. of Paris, 1968), page 37] has shown 
that no sorting method can attain the lower bound 62368 on external path 
length. (It is possible to prove this fact without a computer, using the results of 
exercise 22.) On the other hand, he has constructed procedures that do achieve 
the lower bound (34) when n = 9 or 10. In general, the problem of minimizing 
the average number of comparisons turns out to be substantially more difficult 
than the problem of determining S(n). It may even be true that, for some n, all 
methods that minimize the average number of comparisons require more than 
S(n) comparisons in their worst case. 

EXERCISES 

1. [20] Draw the comparison trees for sorting four elements using the method of 
(a) binary insertion; (b) straight two-way merging. What are the external path lengths 
of these trees? 

2 . [M24] Prove that B(n) < L(n), and find all n for which equality holds. 


194 


SORTING 


5 . 3.1 


3. [M22] ( Weak orderings.) When equality between keys is allowed, there are 13 
possible outcomes when sorting three elements: 

AT = K 2 = K 3 , Ki = K 2 < K 3 , K\ = K 3 < AT. 

K 2 = K 3 <K u K 1 <K 2 = K 3 , K 2 < AT = K 3 , 

K 3 <K x = K 2 . Ki < K 2 < K 3 , Ki < K 3 < AT . 

K 2 < Ki < K 3 , K 2 < K 3 < AT, K 3 < K i < K 2 , K 3 < K 2 < K\. 

Let P„ denote the number of possible outcomes when n elements are sorted with ties 
allowed, so that ( Po , P\,P 2 ,P 3 , Pa,P 3 , . . . ) = (1, 1, 3, 13, 75, 541, . . . ). Prove that the 
generating function P(z) = Yln> o PnZ n /n\ is equal to 1/(2 — e z ). Hint: Show that 


4. [ HM27 } (O. A. Gross.) Determine the asymptotic value of the numbers P n of 
exercise 3, as n -> oo. [Possible hint: Consider the partial fraction expansion of cot 2 .] 

5. [16] When keys can be equal, each comparison may have three results instead 
of two: Ki < Kj, Ki = Kj, Ki > K , . Sorting algorithms for this general situation 
can be represented as extended ternary trees, in which each internal node i:j has 
three subtrees; the left, middle, and right subtrees correspond respectively to the three 
possible outcomes of the comparison. 

Draw an extended ternary tree that defines a sorting algorithm for n = 3, when 
equal keys are allowed. There should be 13 external nodes, corresponding to the 13 
possible outcomes listed in exercise 3. 

► 6. [M22] Let S' (n) be the minimum number of comparisons necessary to sort n 
elements and to determine all equalities between keys, when each comparison has three 
outcomes as in exercise 5. The information-theoretic argument of the text can readily 
be generalized to show that S' (n) > [log 3 P Tl ], where P n is the function studied in 
exercises 3 and 4; but prove that, in fact, S'(n) = S{n). 

7. [20] Draw an extended ternary tree in the sense of exercise 5 for sorting four 
elements, when it is known that all keys are either 0 or 1. (Thus if AT < K 2 and 
AT < A 4 , we know that A'i = K 3 and AT = A 4 !) Use the minimum average number 
of comparisons, assuming that the 2 4 possible inputs are equally likely. Be sure to 
determine all equalities that are present; for example, don’t stop sorting when you 
know only that K\ < K 2 < AT < A 4 . 

8. [26] Draw an extended ternary tree as in exercise 7 for sorting four elements, 
when it is known that all keys are either -1, 0, or +1. Use the minimum average 
number of comparisons, assuming that the 3 4 possible inputs are equally likely. 

9. [M20] When sorting n elements as in exercise 7, knowing that all keys are 0 or 1. 
what is the minimum number of comparisons in the worst case? 

► 10 . [M25] When sorting n elements as in exercise 7, knowing that all keys are 0 or 1. 
what is the minimum average number of comparisons as a function of n? 

11 . [HM27] When sorting n elements as in exercise 5, and knowing that all keys are 
members of the set (1,2, . . . ,m}, let S m (n) be the minimum number of comparisons 
needed in the worst case. [Thus by exercise 6, S„(n) = S(n).] Prove that, for fixed m, 
S m {n) is asymptotically nlgm + 0(1) as n — > oo. 


£(;> 

fc>o 


when n > 0. 


5.3.1 


MINIMUM-COMPARISON SORTING 195 


► 12. [M25] (W. G. Bouricius, circa 1954.) Suppose that equal keys may occur, but we 
merely want to sort the elements {K i, K 2 , . . . , K n ] so that a permutation ai 02 . . . a„ 
is determined with K ai < K a2 < • • • < K as ; we do not need to know whether or not 
equality occurs between K ai and K ai+1 . 

Let us say that a comparison tree sorts a sequence of keys strongly if it will sort 
the sequence in the stated sense no matter which branch is taken below the nodes i:j 
for which Ki = Kj. (The tree is binary, not ternary.) 

a) Prove that a comparison tree with no redundant comparisons sorts every sequence 
of keys strongly if and only if it sorts every sequence of distinct keys. 

b) Prove that a comparison tree sorts every sequence of keys strongly if and only if 
it sorts every sequence of zeros and ones strongly. 

13. [M28] Prove (17). 

14. [M24] Find a closed form for the sum (19). 

15. [ M21 ] Determine the asymptotic behavior of B(n ) and F(n) up to O(logn). 
[Hint: Show that in both cases the coefficient of n involves the function shown in 
Fig. 37.] 

16. [ HM26 ] (F. Hwang and S. Lin.) Prove that F(n) > [lgn!"| for n > 22. 

17. [M20] Prove (29). 

18. [20] If the procedure whose first steps are shown in Fig. 36 had produced the 
linear graph — • — — — — •— — » « « — • — ■ — • with efficiency 12! /2 29 , would this have proved 
that 5(12) = 29? 

19. [40] Experiment with the following heuristic rule for deciding which pair of el- 
ements to compare next while designing a comparison tree: At each stage of sorting 
{AT, . . . , Kn], let Ui be the number of keys known to be < Ki as a result of the com- 
parisons made so far, and let v l be the number of keys known to be > Ki, for 1 < i < n. 
Renumber the keys in terms of increasing Ui/vi, so that u\/v\ < U 2 /V 2 < • • ■ < u n /v n . 
Now compare K t :Ki + 1 for some i that minimizes \uiVi + i — Ui + iVi\. (Although this 
method is based on far less information than a full comparison matrix as in (24), it 
appears to give optimum results in many cases.) 

► 20. [M26] Prove that an extended binary tree has minimum external path length if 
and only if there is a number l such that all external nodes appear on levels l and l + 1 
(or perhaps all on a single level l). 

21. [M21] The height of an extended binary tree is the maximum level number of its 
external nodes. If x is an internal node of an extended binary tree, let t,(x) be the 
number of external nodes below x, and let l(x) denote the root of x’s left subtree. If 
x is an external node, let t(x) = 1. Prove that an extended binary tree has minimum 
height among all binary trees with the same number of nodes if 

\t(x) - 2t{l(x))\ < 2 rigtW1 — t(x) 

for all internal nodes x. 

22 . [M24] Continuing exercise 21, prove that a binary tree has minimum external 
path length among all binary trees with the same number of nodes if and only if 

\t{x) - 2t(l(x))\ < 2 r ‘ gt(x)1 - t(x) and | t(x) - 2t{l(x))\ < t(x) - 2 Llg ‘ WJ 

for all internal nodes x. [Thus, for example, if t(x) = 67, we must have t.{l(x)) = 32, 
33, 34, or 35. If we merely wanted to minimize the height of the tree we could have 
3 < t(l(x)) < 64, by the preceding exercise.] 


196 SORTING 


5.3.1 


23. [10] The text proves that the average number of comparisons made by any sorting 
method for n elements must be at least [lgn!] « n lg n. But multiple list insertion 
(Program 5.2. 1M) takes only 0(n) units of time on the average. How can this be? 

24. [27] (C. Picard.) Find a sorting tree for six elements such that all external nodes 
appear on levels 10 and 11 . 

25. [11] If there were a sorting procedure for seven elements that achieves the min- 
imum average number of comparisons predicted by the use of Eq. ( 34 ). how many 
external nodes would there be on level 13? 

26. [M 42 ] Find a sorting procedure for seven elements that minimizes the average 
number of comparisons performed. 

► 27. [20] Suppose it is known that the configurations A'i < K 2 < K 3 , A'i < K 3 < A' 2 , 
<: ^ 1 K 2 < A 3 < A'i, K 3 < A'i < A 2 , K 3 < K 2 < Ai occur with respective 

probabilities .01, .25, .01, .24, .25, .24. Find a comparison tree that sorts these three 
elements with the smallest average number of comparisons. 

28. [40] Write a MIX program that sorts five one-word keys in the minimum possible 
amount of time, and halts. (See the beginning of Section 5.2 for ground rules.) 

29. [M25] (S. M. Chase.) Let ai o 2 . . . a n be a permutation of {1, 2, ... , n}. Prove that 
any algorithm that decides whether this permutation is even or odd (that is, whether 
it has an even or odd number of inversions), based solely on comparisons between the 
a’s, must make at least nlgn comparisons, even though the algorithm has only two 
possible outcomes. 

30. [M23] (Optimum exchange sorting.) Every exchange sorting algorithm as defined 
in Section 5.2.2 can be represented as a comparison-exchange tree, namely a binary tree 
structure whose internal nodes have the form i:j for i < j, interpreted as the following 
operation: “If K t < Kj, continue by taking the left branch of the tree; if K, > Kj , 
continue by interchanging records i and j and then taking the right branch of the tree.” 
When an external node is encountered, it must be true that K\ < K 2 < ■ ■ ■ < K n . 
Thus, a comparison-exchange tree differs from a comparison tree in that it specifies 
data movement as well as comparison operations. 

Let S e (n) denote the minimum number of comparison-exchanges needed, in the 
worst case, to sort n elements by means of a comparison-exchange tree. Prove that 
S e (n) < S(n) + n — 1 . 

31. [M38] Continuing exercise 30, prove that S e (5) = 8. 

32. [M 42 ] Continuing exercise 31, investigate S e (n) for small values of n > 5 . 

33. [M30] (T. N. Hibbard.) A real-valued search tree of order x and resolution d is 
an extended binary tree in which all nodes contain a nonnegative real value such that 
(i) the value in each external node is < 5 , (ii) the value in each internal node is at 
most the sum of the values in its two children, and (iii) the value in the root is x. The 
weighted path length of such a tree is defined to be the sum, over all external nodes, of 
the level of that node times the value it contains. 

Prove that a real-valued search tree of order x and resolution 1 has minimum 
weighted path length, taken over all such trees of the same order and resolution, if and 
only if equality holds in (ii) and the following further conditions hold for all pairs of 
values Xo and x\ that are contained in sibling nodes: (iv) There is no integer k T 0 such 
that *0 < 2 < ii or ii < 2 < xq. (v) [xo] — xq + [xj] — xi < 1 . (In particular if x is 
an integer, condition (v) implies that all values in the tree are integers, and condition 
(iv) is equivalent to the result of exercise 22 .) 


5.3.2 


MINIMUM-COMPARISON MERGING 197 


Also prove that the corresponding minimum weighted path length is x [ lg x] + 

|V| -2 riga: l 

34. [M50] Determine the exact value of S(n) for infinitely many n. 

35. [49] Determine the exact value of S(16). 

36. [M50] (S. S. Kislitsyn, 1968.) Prove or disprove: Any directed acyclic graph G 
with T(G) > 1 has two vertices u and v such that the digraphs G\ and G 2 ob- 
tained from G by adding the arcs u <— v and u — > v are acyclic and satisfy 1 < 
T(Gi)/T(G 2 ) < 2. (Thus T(Gi)/T(G) always lies between | and for some u and v.) 


*5.3.2. Minimum-Comparison Merging 

Let us now consider a related question: What is the best way to merge an 
ordered set of m elements with an ordered set of n? Denoting the elements to 
be merged by 


Ai < A 2 < ■■ ■ < A m and B x < B 2 < ■ • • < B n , ( 1 ) 


we shall assume as in Section 5.3.1 that the m + n elements are distinct. The 
A's may appear among the B's in ( m rj ^") ways, so the arguments we have used 
for the sorting problem tell us immediately that at least 



m + n 
m 


) 


( 2 ) 


comparisons are required. If we set m = an and let n — > 00 , while a is fixed, 
Stirling’s approximation tells us that 

lg ( a? cm ") = n (( 1 + a ) 1 S ( 1 + a ) ~ al S a ) - + 0 (!)- (3) 

The normal merging procedure, Algorithm 5.2.4M, takes m + n — 1 comparisons 
in its worst case. 

Let M{m,n) denote the function analogous to S(n), namely the minimum 
number of comparisons that will always suffice to merge m things with n. By 
the observations we have just made, 

< M(m, n) < m + n — 1 for all m, n > 1. ( 4 ) 

Formula ( 3 ) shows how far apart this lower bound and upper bound can be. 
When a = 1 (that is, to = n), the lower bound is 2 n — | lgn + 0(1), so both 
bounds have the right order of magnitude but the difference between them can 
be arbitrarily large. When a = 0.5 (that is, m = |n), the lower bound is 

|n(lg3 - |) + O(logn), 

which is about lg 3 — | ss 0.918 times the upper bound. And as a decreases, the 
bounds get farther and farther apart, since the standard merging algorithm is 
primarily designed for files with m « n. 



198 SORTING 


5.3.2 


When m = n, the merging problem has a fairly simple solution; it turns 
out that the lower bound of (4), not the upper bound, is at fault. The follow- 
ing theorem was discovered independently by R. L. Graham and R. M. Karp 
about 1968 : 

Theorem M. For all m > 1 , we have M(m,m) = 2 m — 1 . 

Proof. Consider any algorithm that merges A x < ■ ■ • < A m with B 1 < • • • < B m . 
When it compares Ai'.Bj, take the branch A t < Bj if * < j, the branch A* > Bj 
if i > j. Merging must eventually terminate with the configuration 

B\ < Ai < B 2 < A 2 < ■ ■ ■ < B m < A m , (5) 

since this is consistent with all the branches taken. And each of the 2 to - 1 
comparisons 

Bi'.Ai, Ai:B 2 , B 2 \A 2 , ..., B m :A m 

must have been made explicitly, or else there would be at least two configurations 
consistent with the known facts. For example, if Ai has not been compared to 
B 2 , the configuration 

B x < B 2 < Ai < A 2 < ■ ■ ■ < B m < A m 

is indistinguishable from (5). | 

A simple modification of this proof yields the companion formula 

M(to,to+1) = 2to, for to > 0. (6) 

Constructing lower bounds. Theorem M shows that the “information the- 
oretic” lower bound (2) can be arbitrarily far from the true value; thus the 
technique used to prove Theorem M gives us another way to discover lower 
bounds. Such a proof technique is often viewed as the creation of an adversary , 
a pernicious being who tries to make algorithms run slowly. When an algorithm 
for merging decides to compare Ai : Bj , the adversary determines the fate of the 
comparison so as to force the algorithm down the more difficult path. If we can 
invent a suitable adversary, as in the proof of Theorem M, we can ensure that 
every valid merging algorithm will have to make quite a few comparisons. 

We shall make use of constrained adversaries, whose power is limited with 
regard to the outcomes of certain comparisons. A merging method that is under 
the influence of a constrained adversary does not know about the constraints, 
so it must make the necessary comparisons even though their outcomes have 
been predetermined. For example, in our proof of Theorem M we constrained all 
outcomes by condition (5), yet the merging algorithm was unable to make use 
of that fact in order to avoid any of the comparisons. 

The constraints we shall use in the following discussion apply to the left and 
right ends of the files. Left constraints are symbolized by 
. (meaning no left constraint), 

\ (meaning that all outcomes must be consistent with A\ < Bi), 

/ (meaning that all outcomes must be consistent with A 1 > B 1); 


5 . 3.2 


MINIMUM-COMPARISON MERGING 199 


similarly, right constraints are symbolized by 
. (meaning no right constraint), 

\ (meaning that all outcomes must be consistent with A m < B n ), 

/ (meaning that all outcomes must be consistent with A m > B n ). 

There are nine kinds of adversaries, denoted by A Mp, where A is a left constraint 
and p is a right constraint. For example, a \M\ adversary must say that Ai < Bj 
and Ai < B n \ a .M. adversary is unconstrained. For small values of m and n, 
constrained adversaries of certain kinds are impossible; when to = 1 we obviously 
can’t have a \M / adversary. 

Let us now construct a rather complicated, but very formidable, adversary 
for merging. It does not always produce optimum results, but it gives lower 
bounds that cover a lot of interesting cases. Given to, n, and the left and right 
constraints A and p, suppose the adversary is asked which is the greater of Ai 
or Bj. Six strategies can be used to reduce the problem to cases of smaller m + n : 

Strategy A (k,l), for i < fc < m and 1 < l < j. Say that A t < Bj , and 
require that subsequent operations merge {Ai, . . . , A k ) with {B i, . . . , i} and 
{Ak+ii . . . , A m } with {£?;, . . . , B n }. Thus future comparisons A p : B q will result 
in A p < B q if p < k and q > l\ A p > B q if p > k and q < l] they will be 
handled by a ( fc , l— 1, A, .) adversary if p < fc and q < l; they will be handled by 
an (m— fc, n+l—l, . , p) adversary if p > fc and q > l. 

Strategy B (fc,/), for i < k < m and 1 < l < j- Say that A t < Bj , and 
require that subsequent operations merge {Ai, . . . , A}.} with { Bi , . . . , Bi} and 
{A fc+ ip. . , A m } with {Bi, . . . ,B n }, stipulating that A k < Bi < A k +i- (Note 
that Bi appears in both lists to be merged. The condition A k < Bi < A k + 1 
ensures that merging one group gives no information that could help to merge 
the other.) Thus future comparisons A p :B q will result in A p < B q if p < k and 
q > l ; A p > B q if p k cind q <i; they will be handled by a (fc, l. A, \) adversary 
if p < fc and q < l; by an (to— fc, n+l—l, /, p) adversary if p > fc and q>l. 

Strategy C(k,l), for i < k < m and 1 < l < j. Say that A* < Bj, and 
require that subsequent operations merge {Ai, . . . , A k } with {B \, . . . , and 

{A k , . . . , A m } with {Bi , . . . , B n }, stipulating that < A k < Bi. (Analogous 
to Strategy B, interchanging the roles of A and B.) 

Strategy A '(fc,/), for 1 < fc < i and j < l < n. Say that A t > Bj, and 
require the merging of {Ai, . . . , A k - 1 } with {Bj, . . . , B{\ and {A k , ■ ■ ■ , A m } with 
{B [ +x , . . . , B n }. (Analogous to Strategy A.) 

Strategy B ' (fc, Z) , for 1 < fc < i and j < l < n. Say that A; > Bj, and 
require the merging of {A 1; . . . , A k - 1 } with {B i, . . . , Bi) and {A k , ■ ■ ■ , A m } with 
{B[ B n ), subject to A k ~ i < Bi < A k - (Analogous to Strategy B.) 

Strategy C' (fc, Z) , for 1 < fc < i and j < l < n. Say that A t > Bj, and 
require the merging of {Ai, . . . , A k ) with {B\, . . . ,Bi) and {A k , ■ ■ ■ , A m ) with 
[Bi + 1 , . . . , B n }, subject to Bi < A k < Bi+\. (Analogous to Strategy C.) 


200 SORTING 


5.3.2 


Because of the constraints, the strategies above cannot be used in certain 
cases summarized here: 

Strategy Must be omitted when 

A(M), B(*,1), C(M) A = / 

A'(M), C'(1,0 A = \ 

A(m,l), B(m,Z), C(m,l) P — / 

k !(k,n),B'(k,n),G(k,n) p = \ 

Let A Mp(m,n) denote the maximum lower bound for merging that is ob- 
tainable by an adversary of the class described above. Each strategy, when 
applicable, gives us an inequality relating these nine functions, when the first 
comparison is Ai : Bj , namely, 

A (k, l ): A Mp(m, n) > 1 + A M.(k, l— 1) + .Mp(m—k, n+1— Z); 

B(fc, l ): XMp(m, n) > 1 + A M\(k, l ) + /Mp(m—k, n+1— Z); 

C (k,l): \Mp(m,n) > 1 + XM/(k,l-l) + \Mp(m+l-k,n+l-l); 

A !(k, l ): A Mp(m, n) > 1 + XM.(k—l, Z) + .Mp(m+l—k, n-l)] 

B'(k,l): \Mp(m,n) > 1 + XM\(k—l,l) + /Mp(m+l-k,n+l—l); 

C’(k,l): XMp(m, n) > 1 + A M/(k,l) + \Mp(m+l-k,n-l). 

For fixed i and j, the adversary will adopt a strategy that maximizes the lower 
bound given by all possible right-hand sides, when k and l lie in the ranges 
permitted by i and j. Then we define XMp(m,n) to be the minimum of these 

lower bounds taken over 1 < i < m and 1 < j < n. When rri or n is zero, 

A Mp(m,n) is zero. 

For example, consider the case m = 2 and n = 3, and suppose that our 
adversary is unconstrained. If the first comparison is Ai :B\, the adversary may 
adopt strategy A'(l, 1), requiring ,M.( 0, 1) + ,M.( 2, 2) = 3 further comparisons. 
If the first comparison is Ai'.B^, the adversary may adopt strategy B(l,2), 
requiring .M\(l,2) + /M. (1,2) = 4 further comparisons. No matter what 
comparison Ai :Bj is made first, the adversary can guarantee that at least three 
further comparisons must be made. Hence .M.( 2,3) = 4. 

It isn’t easy to do these calculations by hand, but a computer can grind out 
tables of XMp functions rather quickly. There are obvious symmetries, such as 

/M.(m,n) — . M\(m,n ) = \M.(n,m) = .M/(n,m), ( 7 ) 

by means of which we can reduce the nine functions to just four, 

.M.(m,n), /M.(m,n), /M\(m,n), and /M/(m,n). 

Table 1 shows the resulting values for all m, n < 10; our merging adversary has 
been defined in such a way that 


n) < M(m,n) for all 


m, n > 0 . 


( 8 ) 


5.3.2 


MINIMUM-COMPARISON MERGING 201 


Table 1 

LOWER BOUNDS FOR MERGING, FROM THE “ADVERSARY” 
,M.(m,n ) /M.(m,n) 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 n 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 


1 

1 

2 

2 

3 

3 

3 

3 

4 

4 

4 

1 

2 

2 

3 

3 

3 

3 

4 

4 

4 

1 

2 

2 

3 

4 

5 

5 

6 

6 

6 

7 

7 

1 

3 

4 

4 

5 

5 

6 

6 

7 

7 

2 

3 

2 

4 

5 

6 

7 

7 

8 

8 

9 

9 

1 

3 

5 

6 

7 

7 

8 

8 

9 

9 

3 

4 

3 

5 

6 

7 

8 

9 

10 

10 

11 

11 

1 

4 

5 

7 

8 

9 

9 

10 

10 

11 

4 

5 

3 

5 

7 

8 

9 

10 

11 

12 

12 

13 

1 

4 

6 

8 

9 

10 

11 

12 

12 

13 

5 

6 

3 

6 

7 

9 

10 

11 

12 

13 

14 

15 

1 

4 

6 

8 

10 

11 

12 

13 

14 

14 

6 

7 

3 

6 

8 

10 

11 

12 

13 

14 

15 

16 

1 

4 

7 

9 

10 

12 

13 

14 

15 

16 

7 

8 

4 

6 

8 

10 

12 

13 

14 

15 

16 

17 

1 

5 

7 

9 

11 

13 

14 

15 

16 

17 

8 

9 

4 

7 

9 

11 

12 

14 

15 

16 

17 

18 

1 

5 

8 

10 

11 

13 

15 

16 

17 

18 

9 

10 

4 

7 

9 

11 

13 

15 

16 

17 

18 

19 

1 

5 

8 

10 

12 

14 

15 

17 

18 

19 

10 

m 





















m 





/M\(m, n) 








/M/(m,n 

) 





1 

—00 

2 

2 

3 

3 

3 

3 

4 

4 

4 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

—00 

2 

4 

4 

5 

5 

6 

6 

7 

7 

1 

3 

3 

4 

4 

4 

4 

5 

5 

5 

2 

3 

—00 

2 

4 

6 

6 

7 

8 

8 

8 

9 

1 

3 

5 

5 

6 

6 

7 

7 

8 

8 

3 

4 

— 00 

2 

5 

6 

8 

8 

9 

10 

10 

11 

1 

4 

5 

7 

7 

8 

9 

9 

9 

10 

4 

5 

—00 

2 

5 

7 

8 

10 

10 

11 

12 

13 

1 

4 

6 

7 

9 

9 

10 

11 

11 

12 

5 

6 

—00 

2 

5 

7 

9 

10 

12 

13 

14 

14 

1 

4 

6 

8 

9 

11 

11 

12 

13 

14 

6 

7 

—00 

2 

5 

8 

10 

11 

12 

14 

15 

16 

1 

4 

7 

9 

10 

11 

13 

14 

15 

15 

7 

8 

—00 

2 

6 

8 

10 

12 

13 

15 

16 

17 

1 

5 

7 

9 

11 

12 

14 

15 

16 

17 

8 

9 

— 00 

2 

6 

9 

10 

12 

14 

16 

17 

18 

1 

5 

8 

9 

11 

13 

15 

16 

17 

18 

9 

10 

—00 

2 

6 

9 

11 

13 

15 

16 

18 

19 

1 

5 

8 

10 

12 

14 

15 

17 

18 

19 

10 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 n 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 



This relation includes Theorem M as a special case, because our adversary will 
use the simple strategy of that theorem when \m — n\ < 1 . 

Let us now consider some simple relations satisfied by the M function: 


M(m,n) = M(n,m); 


(9) 

M(m,n) < M(m, n+1); 


( 10 ) 

M(k+m, n) < M(k , n) + M(m, n): 


( 11 ) 

M(m,n) < max(M(? 7 i, n— 1) + 1, M(m—l,n) + l), 

for m > 1 , n > 1 ; 

( 12 ) 

M(m,n) < max(M(ra, n— 2) + 1, M(m—l,n) + 2), 

for m > 1 , n > 2 . 

03 ) 


Relation ( 12 ) comes from the usual merging procedure, if we first compare 
Ai '.Bi. Relation ( 13 ) is derived similarly, by first comparing Ai : B 2 \ if -4, > U 2 , 
we need M(m,n—2) more comparisons, but if A\ < B 2 , we can insert ^4i into 
its proper place and merge {A 2 , . . . , A m } with {B 1 , . . . , B n }. Generalizing, we 
can see that if m > 1 and n > k we have 

M(m,n) < max(M(m,n— k) + 1, M(m—l,n) + 1 + [IgA:]), ( 14 ) 

by first comparing Ai : B). and using binary search if Ai < Bk- 

It turns out that M(m, n) = n) for all m, n < 10, so Table 1 actually 

gives the optimum values for merging. This can be proved by using (g)-(i 4 ) 
together with special constructions for (m, n) = (2,8), (3,6), and (5,9) given in 
exercises 8 , 9, and 10. 


202 SORTING 


5.3.2 


On the other hand, our adversary doesn’t always give the best possible 
lower bounds; the simplest example is m = 3, n = 11, when ,M.{ 3, 11) = 9 
but M(3, 11) = 10. To see where the adversary has “failed” in this case, we 
must study the reasons for its decisions. Further scrutiny reveals that if (i,j) / 
(2, 6), the adversary can find a strategy that demands 10 comparisons; but when 
(*, j) = (2,6), no strategy beats Strategy A(2,4), leading to the lower bound 
1 + .M.( 2,3) + ,M.(1,8) - 9. It is necessary but not sufficient to finish by 
merging {A 1 ,A 2 } with {B 1 ,B 2 ,B 3 } and {A 3 } with {B 4 , . . . , B n }, so the lower 
bound fails to be sharp in this case. 

Similarly it can be shown that ,M.( 2,38) = 10 while M( 2,38) = 11, so our 
adversary isn’t even good enough to solve the case m = 2. But there is an infinite 
class of values for which it excels: 

Theorem K. M(m, m+ 2) = 2m + 1, for m > 2; 

M(m, m+ 3) = 2 m + 2, for m > 4; 

M(m, m+4) = 2m. + 3, for m > 6. 

Proof. We can in fact prove the result with M replaced by .M. ; for small m the 

results have been obtained by computer, so we may assume that m is sufficiently 
large. We may also assume that the first comparison is A t :Bj where i < [m/2]. 
If j < i we use strategy A'(i,i), obtaining 

■M. (m, m+d) > 1 + .M.(i-l,i) + .M.(m+l-i,m+d-i) = 2m + d - 1 
by induction on d, for d < 4. If j > i we use strategy A(i,i+1), obtaining 
.M. (m, m+d ) > 1 + .M.(i,i) + ■ M . ( m-i , m+d-i) = 2m + d - 1 
by induction on m. | 

The first two parts of Theorem K were obtained by F. Hwang and S. Lin 
in 1969. Paul Stockmeyer and Frances Yao showed several years later that the 
pattern evident in these three formulas holds in general, namely that the lower 
bounds derived by the adversarial strategies above suffice to establish the values 
M(m, m+d) = 2m + d - 1 for m > 2d - 2. [SICOMP 9 (1980), 85-90.] 

Upper bounds. Now let us consider upper bounds for M(m, n); good upper 
bounds correspond to efficient merging algorithms. 

When m = 1 the merging problem is equivalent to an insertion problem, 
and there are n + 1 places in which A x might fall among For this 

case it is easy to see that any extended binary tree with n+ 1 external nodes is 
the tree for some merging method! (See exercise 2.) Hence we may choose an 
optimum binary tree, realizing the information-theoretic lower bound 

1 + [ignj = M(l,n) = [lg(n + 1)]. (i 5 ) 

Binary search (Section 6.2.1) is, of course, a simple way to attain this value. 

The case m = 2 is extremely interesting, but considerably harder. It has 
been solved completely by R. L. Graham, F. K. Hwang, and S. Lin (see exercises 


5.3.2 


MINIMUM-COMPARISON MERGING 203 


11, 12, and 13), who proved the general formula 

M(2,n) = flg^(n + 1)] + [lg±f(n + 1)]. (16) 

We have seen that the usual merging procedure is optimum when m = n, 
while the rather different binary search procedure is optimum when m — 1. What 
we need is an in-between method that combines the normal merging algorithm 
with binary search in such a way that the best features of both are retained. 
Formula (14) suggests the following algorithm, due to F. K. Hwang and S. Lin 
[SICOMP 1 (1972), 31-39]: 

Algorithm H ( Binary merging). 

HI. [If not done, choose t.] If m or n is zero, stop. Otherwise, if m > n, set 
t «— [lg(m/n)J and go to step H4. Otherwise set t [lg(n/m)J. 

H2. [Compare.] Compare A m :B n+1 _ 2 1 . If A m is smaller, set n <- n - 2* and 
return to step HI. 

H3. [Insert.] Using binary search (which requires exactly t more comparisons), 
insert A m into its proper place among {B n+1 _ 2 t, . . . , B n }. If k is maximal 
such that Hfc < A m , set m <— m — 1 and n 4— k. Return to HI. 

H4. [Compare.] (Steps H4 and H5 are like H2 and H3, interchanging the roles 
of in and n, A and B.) If B n < A m+1 _ 2 t, set m t— m — 2* and return to 
step HI. 

H5. [Insert.] Insert B n into its proper place among the A’ s. If k is maximal 
such that Afc < B n , set to t— k and n <— n — 1. Return to HI. | 

As an example of this algorithm, Table 2 shows the process of merging 
the three keys {087, 503, 512} with thirteen keys {061, 154, . . . , 908}; eight 
comparisons are required in this example. The elements compared at each step 
are shown in boldface type. 


Table 2 

EXAMPLE OF BINARY MERGING 


A 

B 

Output 

087 

503 

512 

061 

154 

170 

275 

426 

509 

612 

653 

677 

703 

765 

897 

908 




087 

503 

512 

061 

154 

170 

275 

426 

509 

612 

653 

677 


_T 


703 

765 

897 

908 

087 

503 

512 

061 

154 

170 

275 

426 

509 

612 





653 

677 703 

765 

897 

908 

087 

503 

512 

061 

154 

170 

275 

426 

509 

612 





653 

677 

703 

765 

897 

908 

087 

503 


061 

154 

170 

275 

426 

509 



512 

612 

653 

677 

703 

765 

897 

908 

087 

503 


061 

154 

170 

275 

426 

509 



512 

612 

653 

677 

703 

765 

897 

908 

087 



061 

154 

170 

275 

426 


503 

509 

512 

612 

653 

677 

703 

765 

897 

908 

087 



061 

Jr 

154 

170 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 




061 

p87 

154 

170 

275 

426 

503 

509 

512 

612 

653 

677 

703 

765 

897 

908 


Let H(m,n) be the maximum number of comparisons required by Hwang 
and Lin’s algorithm. To calculate H(m,n), we may assume that k — n in step 
H3 and k = m in step H5, since we shall prove that H(m — 1, n) < H(m — 1, n + 1) 


204 SORTING 


5.3.2 


for all n > rn — 1 by induction on to. Thus when m < n we have 

H(m,n) = max(if(TO, n-2*)+l, H(m— 1, n)+f+l), (17) 

for 2 *to <n< 2 t+1 m. Replace n by 2n + e, with e = 0 or 1, to get 

H(m, 2 n+e) = max (H(m, 2n+e-2 t+1 ) + 1, H(m- 1, 2n+e)+f+2) , 
for 2*m < n < 2 t+1 m ; and it follows by induction on n that 


H(m,2n+e) — H(m,n) + m, for to < n and e = 0 or 1. (18) 

It is also easy to see that H(m,n) = m + n - 1 when m < n < 2 to; hence a 
repeated application of (18) yields the general formula 

H(m,n) = to + \n/2 t \ - 1 + tm, for m < n, t=[\g(n/m)\. (19) 

This implies that H(m,n) < H(m, n+1) for all n > m, verifying our inductive 
hypothesis about step H3. 

Setting to = an and 9 — lg (n/m) — t gives 

H(an,n) = an{ 1 + 2 9 - 9 - lga) + 0(1), (20) 

as n -» cxo. We know by Eq. 5.3.1-(36) that 1.9139 < 1 + 2 e - 9 < 2; hence (20) 
may be compared with the information-theoretic lower bound (3). Hwang and 
Lin have proved (see exercise 17) that 


H(m . , n) < 



to + n 

TO. 


+ min (to, n). 


(21) 


The Hwang-Lin binary merging algorithm does not always give optimum 
results, but it has the great virtue that it can be programmed rather easily. 
It reduces to “uncentered binary search” when m = 1, and it reduces to the 
usual merging procedure when m « n, so it represents an excellent compromise 
between those two methods. Furthermore, it is optimum in many cases (see 
exercise 16). Improved algorithms have been found by F. K. Hwang and D. N. 
Deutsch, JACM 20 (1973), 148-159; G. K. Manacher, JACM 26 (1979), 434- 
440; and most notably by C. Christen, FOCS 19 (1978), 259-266. Christen's 
merging procedure, called forward-testing-backward-insertion , saves about m/3 
comparisons over Algorithm H when n/m — >• 00. Moreover, Christen’s procedure 
achieves the lower bound .M. (m,n) = [ (11m + n — 3)/4j when 5 to -3 < n < 
7 to + 2 [to even]; hence it is optimum in such cases (and, remarkably, so is oili- 
adversarial lower bound). 

Formula (18) suggests that the M function itself might satisfy 


M(m,n) < M(m,[n/2\) + m. (22) 

This is actually true (see exercise 19). Tables of M(m,n) suggest several other 
plausible relations, such as 


M(to+1, n) > 1 + M(m,n) > M(to, n+1), for to < n; (23) 

M(m+l, n + l) > 2 + M(m,n); (24) 

but no proof of these inequalities is known. 


5.3.2 


MINIMUM-COMPARISON MERGING 205 


EXERCISES 

1. [15] Find an interesting relation between M(m,n) and the function S defined in 
Section 5.3.1. [Hint: Consider S(m + n).] 

► 2. [22] When m = 1, every merging algorithm without redundant comparisons 
defines an extended binary tree with ( m J] n ) = n + 1 external nodes. Prove that, 
conversely, every extended binary tree with n + 1 external nodes corresponds to some 
merging algorithm with m = 1 . 

3. [M24] Prove that ,M.(l,n) = M(l,n) for all n. 

4. [M42] Is .M. (m,n) > [lg for all m and n? 

5. [M30] Prove that .M. (m, n) < .M\(m, n+1). 

6 . [M26] The stated proof of Theorem K requires that a lot of cases be verified by 
computer. How can the number of such cases be drastically reduced? 

7. [21] Prove (n). 

► 8 . [24] Prove that M( 2,8) < 6 , by finding an algorithm that merges two elements 
with eight others using at most six comparisons. 

9. [21] Prove that three elements can be merged with six in at most seven steps. 

10. [33] Prove that five elements can be merged with nine in at most twelve steps. 
[Hint: Experience with the adversary suggests first comparing A\:B 2 , then trying 
A 5 -.B s if Hr < Bo.] 

11. [M40] (F. Hwang, S. Lin.) Let g 2k = L 2 * g 2k+ 1 = |_2 fc “J, for k > °> so that 
(go,gi,g 2 , ■ ■ ■ ) = (1,1,2,3,4,6,9,13, 19,27, 38,54,77, ...). Prove that it takes more 
than t comparisons to merge two elements with g t elements, in the worst case; but two 
elements can be merged with gt — 1 in at most t steps. [Hint: Show that if n = g t or 
n = gt — 1 and if we want to merge {Ai,A 2 } with {Bi,B 2 , . . . , B n } in t comparisons, 
we can’t do better than to compare A 2 : B gt _ l on the first step.] 

12. [M21] Let R„(i,j ) be the least number of comparisons required to sort the distinct 
objects {a,/3, Xi , . . . , X n }, given the relations 

ot < (3, Xi < X 2 < • ■ • < X n , a < X i+ i, (3 > X n —j. 

(The condition a < X t +i or /3 > X n -j becomes vacuous when i > n or j > n. 
Therefore R n (n,n ) = M(2,n).) 

Clearly, /? n (0,0) = 0. Prove that 

R„(i,j) = 1 +min( min max(f? n (fc— 1 , j), R n -k{i—k, j)), 
l<k<i 

min max(R n (i, fc— 1), R„- k {i, j~k))) 

1 <k<j 

for 0 < i < n, 0 < j < n, i + j > 0 . 

13. [M 42 ] (R. L. Graham.) Show that the solution to the recurrence in exercise 12 
may be expressed as follows. Define the function G(x), for 0 < x < 00 , by the rules 

! 1 , if 0 < x < f; 

2 + jjG( 8 x — 5), if j < x < 5 ; 

|G(2a: — 1), if f < x < 1; 

0 , if 1 < x < 00 . 


206 SORTING 


5.3.2 


(See Fig. 38.) Since R n (i,j) = R n (j,i) and since R n (0,j) = M(l,j), we may assume 

that 1 < i < j < n. Let p = |_lg i\ , q = [lg jj ,r= [lg nj , and let t = n - 2 r + 1. Then 

Rn(i,j) =p + q + S n (i,j) + T n (i,j), 

where S n and T n are functions that are either 0 or 1: 

Sn(i,j) = 1 if and only if q < r or (i - 2 P > u and j - 2 r > u), 

Tn(i,j) = 1 if and only if p < r or (t > f 2 r “ 2 and i- 2 r > v), 

where u = 2 p G(t/2 p ) and v = 2 r_2 G(t/2 r ^ 2 ). 

(This may be the most formidable recurrence relation that will ever be solved!) 

1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 
Fig. 38. Graham’s function (see exercise 13). 

[41] (F. K. Hwang.) Let h 3k = |_§f 2 fc J - 1, h 3k + 1 = h 3k + 3 • 2 k ~ 3 , h 3k+2 = 
LV 2 — f J for k > 3, and let the initial values be defined so that 

(h 0 , hi, h 2 , . . . ) = (1, 1, 2, 2, 3, 4, 5, 7, 9, 11, 14, 18, 23, 29, 38, 48, 60, 76, . . . ) . 

Prove that M(3,ht) > t and M( 3, ht — 1) < t for all t, thereby establishing the exact 
values of M( 3, n) for all n. 

15. [12] Step HI of the binary merge algorithm may require the calculation of the 
expression [lg(n/m)J , for n > m. Explain how to compute this easily without division 
or calculation of a logarithm. 

16. [18] For which m and n is Hwang and Lin’s binary merging algorithm optimum, 
for 1 < m < n < 10? 

17. [ M25 ] Prove (21). [Hint: The inequality isn’t very tight.] 

18. [M 40 ] Study the average number of comparisons used by binary merge. 

► 19. [23] Prove that the M function satisfies (22). 

20. [20] Show that if M(m, n+l) < M(m+l,n) for all m < n, then M(m,n+ 1) < 
1 + M(m, n) for all m < n. 

21 . [M47] Prove or disprove (23) and (24). 



5.3.3 


MINIMUM-COMPARISON SELECTION 207 


22. [M43] Study the minimum average number of comparisons needed to merge m 
things with n. 

23. [M31] (E. Reingold.) Let {Ai,...,A n } and {Bi,...,B n } be sets containing 
n elements each. Consider an algorithm that attempts to test equality of these two 
sets solely by making comparisons for equality between elements. Thus, the algorithm 
asks questions of the form “Is Ai = B 3 ? ” for certain i and j, and it branches depending 
on the answer. 

By defining a suitable adversary, prove that any such algorithm must make at least 
|n(n + 1) comparisons in its worst case. 

24. [22] (E. L. Lawler.) What is the maximum number of comparisons needed by the 
following algorithm for merging m elements with n > m elements? “Set t <— [lg(n/m)J 
and use Algorithm 5.2.4M to merge Ai, A 2 , . . . , A m with B 2 t, B 2 . 2 t , . . . , B q . 2 t, where 
q = L«/2*J. Then insert each Aj into its proper place among the B*,.” 

► 25. [25] Suppose (x t] ) is an m x n matrix with nondecreasing rows and columns: 
Xij < X(i+ i)j for 1 < i < m and Xij < Xi( J+1 ) for 1 < j < n. Show that M(m,n) is 
the minimum number of comparisons needed to determine whether a given number x 
is present in the matrix, if all comparisons are between x and some matrix element. 

*5.3.3. Minimum-Comparison Selection 

A similar class of interesting problems arises when we look for best possible 
procedures to select the t th largest of n elements. 

The history of this question goes back to Rev. C. L. Dodgson’s amusing 
(though serious) essay on lawn tennis tournaments, which appeared in St. James's 
Gazette, August 1, 1883, pages 5-6. Dodgson — who is of course better known 
as Lewis Carroll— ‘was concerned about the unjust manner in which prizes were 
awarded in tennis tournaments. Consider, for example, Fig. 39, which shows 
a typical “knockout tournament” between 32 players labeled 01, 02, . . . , 32. 
In the finals, player 01 defeats player 05, so it is clear that player 01 is the 
champion and deserves the first prize. The inequity arises because player 05 
usually gets second prize, although someone else might well be the second best. 
You can win second prize even if you are worse than half of the players in the 
competition! In fact, as Dodgson observed, the second-best player wins second 
prize if and only if the champion and the next-best are originally in opposite 
halves of the tournament; this occurs with probability 2 n-1 /(2" — 1), when there 
are 2 n competitors, so the wrong player receives second prize almost half of the 
time. If the losers of the semifinal round (players 25 and 1 7 in Fig. 39) compete 
for third prize, it is highly unlikely that the third-best player receives third prize. 

Dodgson therefore set out to design a tournament that determines the true 
second- and third-best players, assuming a transitive ranking. (In other words, if 
player A beats player B and B beats C, Dodgson assumed that A would beat C .) 
He devised a procedure in which losers are allowed to play further games until 
they are known to be definitely inferior to three other players. An example of 
Dodgson’s scheme appears in Fig. 40, which is a supplementary tournament to 
be run in conjunction with Fig. 39. He tried to pair off players whose records in 
previous rounds were equivalent; he also tried to avoid matches in which both 


208 SORTING 


5.3.3 


players had been defeated by the same person. In this particular example, 16 
loses to 11 and 13 loses to 12 in Round 1; after 13 beats 16 in the second 
round, we can eliminate 16 , who is now known to be inferior to 11, 12, and 13. 
In Round 3 Dodgson did not allow 19 to play with 21, since they have both 
been defeated by 18 and we could not automatically eliminate the loser of 19 
versus 21. 


Round 5 (Finals) 


Round 4 
Round 3 


01 


01 


01 

I 


02 


Champion = 01 


25 
I 


05 


25 


29 

i 


05 

_L 


I 

11 

_L 


— I 
05 
I 


17 

I 


17 


18 


Round 2 01 03 02 04 25 26 29 30 05 06 fl I2 17 20 18 21 

rhrhrhi^rhrhrhrhrhrhrhrhrhrhr-hrh 

Round 1 01 0703 10 02 08 04 09 25 28 26 27 29 32 30 31 05 15 06 14 1 1 16 12 13 17 24 20 23 18 19 21 22 
Fig. 39. A knockout tournament with 32 players. 


It would be nice to report that Lewis Carroll’s tournament turns out to be 
optimal, but unfortunately that is not the case. His diary entry for July 23. 
1883, says that he composed the essay in about six hours, and he felt “we are 
now so late in the [tennis] season that it is better it should appear soon than be 
written well.” His procedure makes more comparisons than necessary, and it is 
not formulated precisely enough to qualify as an algorithm. On the other hand, it 
has some rather interesting aspects from the standpoint of parallel computation. 
And it appears to be an excellent plan for a tennis tournament, because he 
built in some dramatic effects; for example, he specified that the two finalists 
should sit out round 5, playing an extended match during rounds 6 and 7. But 
tournament directors presumably thought the proposal was too logical, and so 
Carroll’s system has apparently never been tried. Instead, a method of “seeding” 
is used to keep the supposedly best players in different parts of the tree. 


Round 9 


Third prize = 03 


r~ 

03 


1 


05 


Second prize = 02 


Round 8 02 


Round 7 




02 

1 



Round 6 

r 
02 
, L_ 




1 

06 

1 


Round 5 

02 




1 

06 

1 

07 


rh 



1 — 

1 

1 

Hn 

Round 4 

02 20 

12 

06 

23 

07 29 


rh 

rh 

rh 

1 — 1 — 1 


Round 3 

20 21 

12 19 

06 27 

23 31 

07 08 



1 

-h 

rh 

rh rh 

rh rh 

Round 2 


19 22 

27 28 

23 24 31 32 

07 10 08 09 


I 

03 

I ' 1 

03 11 

1 1 1 Hi 
03 26 11 18 

rh rh 
03 04 26 30 


“1 

05 


1 

03 


17 

rh 
17 25 


13 

r 1 1 

13 14 

rh rh 
13 16 14 15 


Fig. 40. Lewis Carroll’s lawn tennis tournament (played in conjunction with Fig. 39). 


5.3.3 


MINIMUM-COMPARISON SELECTION 209 


In a mathematical seminar during 1929-1930, Hugo Steinhaus posed the 
problem of finding the minimum , number of tennis matches required to determine 
the first and second best players in a tournament, when there are n > 2 players 
in all. J. Schreier [Mathesis Polska 7 (1932), 154-160] gave a procedure that 
requires at most n — 2 + |~lg n] matches, using essentially the same method as the 
first two stages in what we have called tree selection sorting (see Section 5.2.3, 
Fig. 23), avoiding redundant comparisons that involve — oo. Schreier also claimed 
that n — 2 + [lgn] is best possible, but his proof was incorrect, as was another 
attempted proof by J. Slupecki [ Colloquium Mathematician 2 (1951), 286-290]. 
Thirty-two years went by before a correct, although rather complicated, proof 
was finally published by S. S. Kislitsyn [Sibirskii Mat. Zhurnal 5 (1964), 557 564]. 

Let V t (n) denote the minimum number of comparisons needed to determine 
the fth largest of n elements, for 1 < t < n, and let W t (n ) be the minimum 
number required to determine the largest, second largest, . . . , and the ft.h largest, 
collectively. By symmetry, we have 

V t (n) = V n+1 _ t (n), (i) 

and it is obvious that 

V\{n) = W\(n), (2) 

V t (n) < W t (n), ( 3 ) 

W n (n) = W n -i(n) — S(n). ( 4 ) 

We have observed in Lemma 5.2.3M that 

Vi(n) = n-1. (5) 

In fact, there is an astonishingly simple proof of this fact, since everyone in a 
tournament except the champion must lose at least one game! By extending this 
idea and using an “adversary” as in Section 5.3.2, we can prove the Schreier- 
Kislitsyn theorem without much difficulty: 

Theorem S. V _ 2 ( n ) = W^in) = n — 2 + [lgn], for n > 2. 

Proof. Assume that n players have participated in a tournament that has 
determined the second-best player by some given procedure, and let (ij be the 
number of players who have lost j or more matches. The total number of matches 
played is then cq + a 2 + <13 + • • • . We cannot determine the second-best player 
without also determining the champion (see exercise 2), so our previous argument 
shows that cq = n — 1. To complete the proof, we will show that there is always 
some sequence of outcomes of the matches that makes a 2 > [lgn] — 1. 

Suppose that at the end of the tournament the champion has played (and 
beaten) p players; one of these is the second best, and the others must have lost 
at least one other time, so a 2 > p — 1. Therefore we can complete the proof by 
constructing an adversary who decides the results of the games in such a way 
that the champion must play at least |"lgn] other people. 

Let the adversary declare A to be better than B if A is previously undefeated 
and B has lost at least once, or if both are undefeated and B has won fewer 


210 SORTING 


5.3.3 


matches than A at that time. In other circumstances the adversary may make 
an arbitrary decision consistent with some partial ordering. 

Consider the outcome of a complete tournament whose matches have been 
decided by such an adversary. Let us say that “A supersedes B v if and only if A = 
B or A supersedes the player who first defeated B. (Only a player’s first defeat 
is relevant in this relation; a loser’s subsequent games are ignored. According 
to the mechanism of the adversary, any player who first defeats another must 
be previously unbeaten.) It follows that a player who won the first p matches 
supersedes at most 2 l> players on the basis of those p contests. (This is clear 
for p — 0, and for p > 0 the pth match was against someone who was either 
previously beaten or who supersedes at most 2 P ~ 1 players.) Hence the champion, 
who supersedes everyone, must have played at least [lg n] matches. | 

Theorem S completely resolves the problem of finding the second-best player, 
m the minimax sense. Exercise 6 shows, in fact, that it is possible to give a simple 
formula for the minimum number of comparisons needed to find the second 
largest element of a set when an arbitrary partial ordering of the elements is 
known beforehand. 

What if t > 2? In the paper cited above, Kislitsyn went on to consider larger 
values of t, proving that 

W t (n) < n — t. + £ flgjl, for n > t. (6) 

n+1 — t<j<n 

For t — 1 and t— 2 we have seen that equality actually holds in this formula; 
for t = 3 it can be slightly improved (see exercise 21 ). 

We shall prove Kislitsyn’s theorem by showing that the first t stages of tree 
selection require at most n — t + ^ 2 n +i-t<j< n Rsil comparisons, ignoring all of 
the comparisons that involve -oo. It is interesting to note that, by Eq. 5 . 3.1 - (3), 
the right-hand side of (6) equals B(n) when t — n, and also when t = n — 1; 
hence tree selection and binary insertion yield the same upper bound for the 
sorting problem, although they are quite different methods. 

Let a be an extended binary tree with n external nodes, and let tv be a 
permutation of { 1 , 2 ,... ,/),}. Place the elements of tv into the external nodes, 
from left to right in symmetric order, and fill in the internal nodes according to 
the rules of a knockout tournament as in tree selection. When the resulting tree is 
subjected to repeated selection operations, it defines a sequence c„_ 1 c n _ 2 . . . ci, 
where Cj is the number of comparisons required to bring element j to the root 
of the tree when element j + 1 has been replaced by —00. For example, if a is 
the tree 


( 7 ) 


5.3.3 


MINIMUM-COMPARISON SELECTION 211 



If 7r had been 3 1 5 4 2, the sequence c 4 c 3 c 2 Ci would have been 2110 instead. 
It is not difficult to see that C\ is always zero. 

Let n(a, n) be the multiset {c n _j, c n _ 2 , . • . , c x ] determined by a and n. If 



and if elements 1 and 2 do not both appear in a' or both in a ", it is easy to see 
that 

p(a, tv) = (n(a', -k') + l) l±l (p(a", tv") + 1) l±) {0} (8) 

for appropriate permutations tv' and n", where p+1 denotes the multiset obtained 
by adding 1 to each element of /i. (See exercise 7.) On the other hand, if elements 
1 and 2 both appear in a', we have 

H{a, tv) = (n(a', tv') + e) tti (, n(a ", tv") + l) l±) {0}, 

where // + e denotes a multiset obtained by adding 1 to some elements of )i and 
0 to the others. A similar formula holds when 1 and 2 both appear in a". Let us 
say that multiset // 1 dominates // 2 if both / / 1 and /i 2 contain the same number 
of elements, and if the A:th largest element of /p is greater than or equal to the 
fcth largest element of /t 2 for all k; and let us define fi(a) to be the dominant 
P(q,7t), taken over all permutations n, in the sense that p(a) dominates tv) 
for all tv and fi(a) = /r(a, tv) for some tv. The formulas above show that 


/*(□)= 0 ( 



(9) 


hence p(a) is the multiset of all distances from the root to the internal nodes of a. 

The reader who has followed this train of thought will now see that we are 
ready to prove Kislitsyn’s theorem (6). Indeed, Wt(n) is less than or equal to 
n — 1 plus the t — 1 largest elements of n( a )i where a is any tree being used 
in tree selection sorting. We may take a to be the complete binary tree with 
n external nodes (see Section 2. 3. 4. 5), when 


/*(«) = {LigiJ>Lig2J,...,Li g (n-i)j} 

= {fig 2 ! -1, Tig 3] -1, . . . , pgn] -l}. ( 10 ) 

Formula (6) follows when we consider the t — 1 largest elements of this multiset. 


212 SORTING 


5.3.3 



Kislitsyn’s theorem gives a good upper bound for W t (n); he remarked that 
^3(5) = 6 < W 3 (5) = 7, but he was unable to find a better bound for V t (n) than 
for W t (n). A. Hadian and M. Sobel discovered a way to do this using replacement 
selection instead of tree selection; their formula [Univ. of Minnesota. Dept, of 
Statistics Report 121 (1969)], 

V t (n) < n — t + (t - l)[lg(ra + 2 - t)], n>t , (11) 

is similar to Kislitsyn’s upper bound for W t (n) in (6), except that each term in 
the sum has been replaced by the smallest term. 

Hadian and Sobel’s theorem (11) can be proved by using the following 
construction: First set up a binary tree for a knockout tournament on n — t + 2 
items. (This takes n — t + 1 comparisons.) The largest item is greater than 
n — t + 1 others, so it can’t be fth largest. Replace it, where it appears at an 
external node of the tree, by one of the t — 2 elements held in reserve, and find 
the largest element of the resulting n - 1 + 2; this requires at most [ lg(n + 2 - t)l 
comparisons, because we need to recompute only one path in the tree. Repeat 
this operation t — 2 times in all, for each element held in reserve. Finally, replace 
the currently largest element by -00, and determine the largest of the remaining 
n + 1 — t; this requires at most |"lg(n + 2 — t)] — 1 comparisons, and it brings 
the fth largest element of the original set to the root of the tree. Summing the 
comparisons yields (11). 

In relation (11) we should of course replace t by n + 1 - 1 on the right-hand 
side whenever n+l-t gives a better value (as when n = 6 and t = 3). Curiously, 
the formula gives a smaller bound for 14(13) than it does for 14(13). The upper 
bound in (11) is exact for n < 6, but as n and t get larger it is possible to obtain 
much better estimates of V t (n). 

For example, the following elegant method (due to David G. Doren) can be 
used to show that K 4 (8) < 12. Let the elements be Ai,...,A 8 ; first compare 
X \ : A 2 and A 3 : A 4 and the two winners, and do the same to X 5 :X 6 and X 7 :X S 
and their winners. Relabel elements so that X\ < A 2 < A 4 > X3, X 5 < X 6 < 
X 8 > A 7, then compare A 2 :A 6 ; by symmetry assume that A 2 < A 6 , so that we 
have the configuration 


5 6 



(Now Ai and Ag are out of contention and we must find the third largest of 
{A 2 , . . . , A 7 }.) Compare A 2 : A 7 , and discard the smaller; in the worst case we 
have A 2 < X 7 and we must find the third largest of 



This can be done in V3 (5) — 2 = 4 more steps, since the procedure of (11) that 
achieves 14(5) = 6 begins by comparing two disjoint pairs of elements. 


5.3.3 


MINIMUM-COMPARISON SELECTION 213 


Table 1 

VALUES OF V t (n) FOR SMALL n 


n 

Ki(n) 

V 2 (n) 

V s (n) 

V 4 (n) 

V 5 (n) 

Ve (n) 

V 7 (n) V s {n) Vg (n) Vio(n) 

1 

0 







2 

1 

1 






3 

2 

3 

2 





4 

3 

4 

4 

3 




5 

4 

6 

6 

6 

4 



6 

5 

7 

8 

8 

7 

5 


7 

6 

8 

10 

10* 

10 

8 

6 

8 

7 

9 

11 

12 

12 

11 

9 7 

9 

8 

11 

12 

14 

14* 

14 

12 11 8 

10 

9 

12 

14* 

15 

16** 

16 *. 

15 14* 12 9 

* Exercises 10 

12 give constructions that 

improve on Eq. ( 11 ) in these cases. 


'* See K. Noshita, Trans, of the IECE of Japan E59, 12 (December 1976), 17-18. 

Other tricks of this kind can be used to produce the results shown in Table 1; 
no general method is evident as yet. The values listed for V 4 (9) = 1/6 (9) and 
Vs(10) = V 6 (10) were proved optimum in 1996 by W. Gasarch, W. Kelly, and 
W. Pugh [SIGAC'T News 27,2 (June 1996), 88-96], using a computer search. 

A fairly good lower bound for the selection problem when t is small was 
obtained by David G. Kirkpatrick [ JACM 28 (1981), 150-165]: If 2 < t < 
(n + l)/2, we have 


t-2 

V t (n) >n + t- 3 + y^ 

3= 0 

In his Ph.D. thesis [U. of Toronto, 1974], Kirkpatrick also proved that 


lg 


n — t + 2 

t + j 


V 3 (n) < n + 1 + 


lg 


n — 1 


+ 


(12) 


<T3) 


this upper bound matches the lower bound (12) for lg § « 74% of all integers n, 
and it exceeds (12) by at most 1. Kirkpatrick’s analysis made it natural to 
conjecture that equality holds in (13) for all n > 4, but Jutta Eusterbrock found 
the surprising counterexample V 3 (22) = 28 [Discrete Applied Math. 41 (1993), 
131-137], Improved lower bounds for larger values of t were found by S. W. Bent 
and J. W. John (see exercise 27): 


V t {n) > n + m-2\\fm), m = 2 + |" lg j (n + 1 - t)j . (14) 

This formula proves in particular that 

V a „{n) > (l + alg^ + (1 - a)lg + 0(^/n). (15) 


214 SORTING 


5.3.3 


A linear method. When n is odd and t = fn/2], the tth largest (and tth 
smallest) element is called the median. According to (n), we can find the 
median of n elements in w |nlgn comparisons; but this is only about twice as 
fast as sorting, even though we are asking for much less information. For several 
years, concerted efforts were made by a number of people to find an improvement 
over ( 11 ) when / and n are large. Finally in 1971, Manuel Blum discovered a 
method that needed only 0(n log log n) steps. Blum’s approach to the problem 
suggested a new class of techniques, which led to the following construction due 
to R. Rivest and R. Tarjan [J. Comp, and Sys. Sci. 7 (1973), 448-461]: 

Theorem L. If n > 32 and 1 < t < n, we have V t {n) < 15 n - 163. 

Proof. The theorem is trivial when n is small, since V t (n ) < S(n) < 10n < 
15n — 163 for 32 < n < 2 10 . By adding at most 13 dummy — oo elements, we 
may assume that n = 7(2 q + 1) for some integer q > 73. The following method 
may now be used to select the tth largest: 

Step 1. Divide the elements into 2q + 1 groups of seven elements each, and sort 
each of the groups. This takes at most 13(2 q +1) comparisons. 

Step 2. Find the median of the 2q + 1 median elements obtained in Step 1. 
and call it x. By induction on q , this takes at most V q+1 (2 q + 1) < 30g - 148 
comparisons. 

Step 3. The n - 1 elements other than x have now been partitioned into three 
sets (see Fig. 41): 

4<? + 3 elements known to be greater than x (Region B); 

4g + 3 elements known to be less than x (Region C); 

6 q elements whose relation to x is unknown (Regions A and D). 

By making 4 q additional comparisons, we can tell exactly which of the elements 
in regions A and D are less than x. (We first test x against the middle element 
of each triple.) 

Step 4. We have now found r elements greater than x and n — 1 — r elements 
less than x , for some r. If t = r + 1, x is the answer; if t < r + 1, we need 
to find the tth largest of the r large elements; and if t > r + 1, we need to 
find the (t-l-r)th largest of the n - 1 - r small elements. The point is that 
r and n — 1 r are both less than or equal to 10 q + 3 (the size of regions A 
and D, plus either B or C). By induction on q this step therefore requires at 
most 15(10<? + 3) — 163 comparisons. 

The total number of comparisons comes to at most 

13(2g + 1) + 30g - 148 + 4q + 15(10g + 3) - 163 = 15(14g - 6) - 163. 

Since we started with at least 14g — 6 elements, the proof is complete. | 

Theorem L shows that selection can always be done in linear time, namely 
that V t (n) = 0(n). Of course, the method used in this proof is rather crude, 
since it throws away good information in Step 4. Deeper study of the problem 


5.3.3 


MINIMUM-COMPARISON SELECTION 215 


Region A Region B 



Fig. 41. The selection algorithm of Rivest and Tarjan (q = 4). 

has led to much sharper bounds; for example, A. Schonhage, M. Paterson, and 
N. Pippenger [J. Comp. Sys. Sci. 13 (1976), 184-199] proved that the maximum 
number of comparisons required to find the median is at most 3n + 0(nlogn) 3 / 4 . 
See exercise 23 for a lower bound and for references to more recent results. 

The average number. Instead of minimizing the maximum number of compar- 
isons, we can ask instead for an algorithm that minimizes the average number 
of comparisons, assuming random order. As usual, the minimean problem is 
considerably harder than the minimax problem; indeed, the minimean problem 
is still unsolved even in the case t — 2. Claude Picard mentioned the problem in 
his book Theorie des Questionnaires (1965), and an extensive exploration was 
undertaken by Milton Sobel [Univ. of Minnesota, Dept, of Statistics Reports 
113 and 114 (November 1968); Revue Frangaise d’Automatique, Informatique et 
Recherche Operationnelle 6,R-3 (December 1972), 23-68], 

Sobel constructed the procedure of Fig. 42, which finds the second largest 
of six elements using only 6| comparisons on the average. In the worst case, 
8 comparisons are required, and this is worse than 1^(6) = 7; in fact, an 
exhaustive computer search by D. Hoey has shown that the best procedure for 
this problem, if restricted to at most 7 comparisons, uses 6|| comparisons on 
the average. Thus no procedure that finds the second largest of six elements can 
be optimum in both the minimax and the minimean senses simultaneously. 

Let V t (n) denote the minimum average number of comparisons needed to 
find the <th largest of n elements. Table 2 shows the exact values for small n, as 
computed by D. Hoey. 

R. W. Floyd discovered in 1970 that the median of n elements can be found 
with only + 0(n 2 / 3 logn) comparisons, on the average. He and R. L. Rivest 
refined this method a few years later and constructed an elegant algorithm to 
prove that _ 

V t {n) < n + min(f, n—t) + 0(%/nlogn ). 

(See exercises 13 and 24.) 


(16) 


216 SORTING 


5.3.3 



Fig. 42. A procedure that selects the second largest of { AA . AA . A 3 , AA . A 5 . AA } . using 
6 \ comparisons on the average. Each “symmetrical” branch is identical to its sibling, 
with names permuted in some appropriate maimer. External nodes contain “j jfc” when 
Xj is known to be the second largest and Xk the largest; the number of permutations 
leading to such a node appears immediately below it. 

Using another approach, based on a generalization of one of Sobel’s construc- 
tions for t = 2, David W. Matula [Washington Univ. Tech. Report AMCS-73-9 
(1973)] showed that 

Vt(n) < n + t [lgt](ll + lnlnn). ( 17 ) 

Thus, for fixed t the average amount of work can be reduced to n + O(loglogn) 
comparisons. An elegant lower bound on V t {n) appears in exercise 25. 

The sorting and selection problems are special cases of the much more 
general problem of finding a permutation of n given elements that is consistent 
with a given partial ordering. A. C. Yao [SICOMP 18 (1989), 679-689] has 
shown that, if the partial ordering is defined by an acyclic digraph G on n 
vertices with k connected components, the minimum number of comparisons 
necessary to solve such problems is always 0(lg(n!/T(G)) +n — k), in both the 
worst case and on the average, where T(G ) is the total number of permutations 
consistent with the partial ordering (the number of topological sortings of G). 

EXERCISES 

1. [15] In Lewis Carroll’s tournament (Figs. 39 and 40), why was player 13 elimi- 
nated in spite of winning in Round 3? 


5.3.3 


MINIMUM-COMPARISON SELECTION 217 


Table 2 

MINIMUM AVERAGE COMPARISONS FOR SELECTION 


n 

Vi(n) 

V 2 (n) 

V 3 (n) 

V 4 (n) 

V f (n) 

Ve(n) 

V 7 (n) 

1 

0 







2 

1 

1 






3 

2 

2- 

Z 3 

2 





4 

3 

4 

4 

3 




5 

6 

7 

4 

5 

6 

5 I7 

H 

7 149 
' 210 

513 

D 15 

7— 

1 18 

0 509 
°630 

5 — 

D 15 

7— 

‘ 18 

Q_32_ 

^ 105 

4 

0 509 
°630 

5 

7 149 

1 210 

6 


► 2. [M25] Prove that after we have found the tth largest of n elements by a sequence 
of comparisons, we also know which t — 1 elements are greater than it, and which n — t 
elements are less than it. 

3. [20] Prove that V t (n) > V t (n - 1) and W t (n) > W t {n - 1), for 1 < t < n. 

► 4. [M25] (F. Fussenegger and H. N. Gabow.) Prove that W t (n) >n — t+ [lgn— ]. 
5. [10] Prove that W 3 (n) < V 3 (n) + 1. 

► 6. [M26] (R. W. Floyd.) Given n distinct elements {X\, . . . , X n } and a set of 

relations X t < Xj for certain pairs we wish to find the second largest element. 

If we know that Xi < Xj and X, < Xk for j / k, Xj cannot possibly be the second 
largest, so it can be eliminated. The resulting relations now have a form such as 

• — >>>- 

namely, m groups of elements that can be represented by a multiset {h ,h, . . . , km }', the 
jth group contains lj + 1 elements, one of which is known to be greater than the others. 
For example, the configuration above can be described by the multiset {0, 1, 2, 2, 3, 5}; 
when no relations are known we have a multiset of n zeros. 

Let f(li,l 2 ,... ,l m ) be the minimum number of comparisons needed to find the 
second largest element of such a partially ordered set. Prove that 

2 + f lg(2 1 + 2 2 + • ■ • -f- 2^™ )~| . 

[Hint: Show that the best strategy is always to compare the largest elements of the two 
smallest groups, until reducing m to unity; use induction on h + h + • ■ • + l m + 2m.] 

7. [M20] Prove (8). 

8. [ M21 ] Kislitsyn’s formula (6) is based on tree selection sorting using the complete 
binary tree with n external nodes. Would a tree selection method based on some other 
tree give a better bound, for any t and n? 

► 9. [20] Draw a comparison tree that finds the median of five elements in at most six 
steps, using the replacement-selection method of Hadian and Sobel [see ( 11 )]. 

10. [35] Show that the median of seven elements can be found in at most 10 steps. 



218 SORTING 


5.3.3 


11. [38] (K. Noshita.) Show that the median of nine elements can be found in at 
most 14 steps, of which the first seven are identical to Doren’s method. 

12. [21] (Hadian and Sobel.) Prove that V 3 {n) < V 3 (n - 1) + 2. [Hint: Start by 
discarding the smallest of {X 1 , X 2 , X 3 , X 4 }.] 

► 13. [HM28] (R. W. Floyd.) Show that if we start by finding the median element of 
{.Yi, . . . , X n - 2 /3 }, using a recursively defined method, we can go on to find the median 
of {Ai, . . . ,X n } with an average of | n + 0(n 2 ^ 3 logn) comparisons. 

► 14. [20] (M. Sobel.) Let U t (n) be the minimum number of comparisons needed to 
find the t largest of n elements, without necessarily knowing their relative order. Show 
that 1 / 2 ( 5 ) < 5. 

15. [22] (I. Pohl.) Suppose that we are interested in minimizing space instead of time. 
What is the minimum number of data words needed in memory in order to compute 
the fth largest of n elements, if each element fills one word and if the elements are 
input one at a time into a single register? 

► 16. [25] (I. Pohl.) Show that we can find both the maximum and the minimum of a 
set of n elements, using at most |"§n] - 2 comparisons; and the latter number cannot 
be lowered. [Hint: Any stage in such an algorithm can be represented as a quadruple 
( a,b,c,d ), where a elements have never been compared, b have won but never lost, 
c have lost but never won, d have both won and lost. Construct an adversary.] 

17. [20] (R. W. Floyd.) Show that it is possible to select, in order, both the k largest 
and the l smallest elements of a set of n elements, using at most |"|n] — k — l + 
En+l —k<j<n I! + En+l-|<)<n fig j] comparisons. 

18. [M20] If groups of size 5, not 7, had been used in the proof of Theorem L, what 
theorem would have been obtained? 

19. [M 42 ] Extend Table 2 to n = 8. 

20. [M47] What is the asymptotic value of V 2 (n) — n, as n — > 00 ? 

21. [32] (P. V. Ramanan and L. Hyafil.) Prove that W t ( 2 k + 2 k+1 ~ t ) < 2 k + 2 fc+1 ~‘ + 

1 )(^ 1)5 when k T t /■ 2 : also show that equality holds for infinitely many 

k and f, because of exercise 4. [Hint: Maintain two knockout trees and merge their 
results cleverly.] 

22. [ 24 ] (David G. Kirkpatrick.) Show that when 4 • 2 k < n - 1 < 5 • 2 k , the upper 
bound ( 11 ) for V 3 (n) can be reduced by 1 as follows: (i) Form four knockout trees of 
size 2 . (ii) Find the minimum of the four maxima, and discard all 2 k elements of its 
tree, (iii) Using the known information, build a single knockout tree of size n - 1 - 2 k . 
(iv) Continue as in the proof of ( 11 ). 

23. [M49] What is the asymptotic value of Uf n / 2 i (n), asn-> 00 ? 

24. [HM 40 ] Prove that V t (n) < n + t + O(Vnlogn) for t < \n/ 2], Hint: Show 
that with t his many comparisons we can in fact find both the [ t - Vt lnnjth and 
\t + \Jt lnn]th elements, after which the fth is easily located. 

► 25. [M35] (W. Cunto and J. I. Munro.) Prove that V t (n) > n + t-2 when t. < \n/ 2], 
26. [M32] (A. Schonhage, 1974.) (a) In the notation of exercise 14, prove that U t (n) > 
min(2-f U t (n— 1), 2 + t/ t _i(n — 1)) for n > 3. [Hint: Construct an adversary by reducing 
from n to n - 1 as soon as the current partial ordering is not composed entirely of 
components having the form • or • — • .] (b) Similarly, prove that 

Ut(n ) > min(2 + U t (n - 1),3 + [/ t _i(n - 1),3 + U t (n - 2)) 


5.3.4 


NETWORKS FOR SORTING 


219 


for n > 5, by constructing an adversary that deals with components • , • — • , , 

*^t>* • (c) Therefore we have U t (n) > n + t + min([(n — t)/2j,t) — 3 for 1 < t < n/2. 
[The inequalities in (a) and (b) apply also when V or W replaces U, thereby establishing 
the optimality of several entries in Table 1.] 

► 27. [MS 4 ] A randomized adversary is an adversary algorithm that is allowed to flip 
coins as it makes decisions. 

a) Let A be a randomized adversary and let Pr (l) be the probability that A reaches 
leaf l of a given comparison tree. Show that if Pr(Z) < p for all l, the height of the 
comparison tree is > lg(l/p). 

b) Consider the following adversary for the problem of selecting the tth largest of n 
elements, given integer parameters q and r to be selected later: 

Al. Choose a random set T of t elements; all (") possibilities are equally likely. 
(We will ensure that the t — 1 largest elements belong to T.) Let 5 = 
(1, • ■ ■ , n} \ T be the other elements, and set So <— S, T 0 T; So and To will 
represent elements that might become the tth largest. 

A2. While |To| > r, decide all comparisons x:y as follows: If a; 6 5 and y 6 T, say 
that x < y. If x G S and y £ S, flip a coin to decide, and remove the smaller 
element from So if it was in So. If x £ T and y £ T, flip a coin to decide, and 
remove the larger element from To if it was in T 0 . 

A3. As soon as |To| = r , partition the elements into three classes P, Q, R as follows: 
If | So | < q, let P = S, Q = To, R = T \ To. Otherwise, for each y £ To, let 
C(y) be the elements of S already compared with y, and choose yo so that 
|C(2/o)| is minimum. Let P = (S \ S 0 ) U C(y 0 ), Q = (S 0 \ C(y 0 )) U {j/ 0 }, 
R = T\{y 0 }. Decide all future comparisons x:y by saying that elements of P 
are less than elements of Q, and elements of Q are less than elements of R-, 
flip a coin when x and y are in the same class. | 

Prove that if 1 < r < t and if |C(yo)| < q — r at the beginning of step A3, each 
leaf is reached with probability < (n+ 1 - t)/( 2 n ~ ? (")). Hint: Show that at least 
n — q coin flips are made. 

c) Continuing (b), show that we have 

Vt(n ) > min(n - 1 + (r - 1)(<?+ 1 - r), n - q + lg((")/(n + 1 - t ))) , 

for all integers q and r. 

d) Establish (14) by choosing q and r. 

*5.3.4. Networks for Sorting 

In this section we shall study a constrained type of sorting that is particularly 
interesting because of its applications and its rich underlying theory. The new 
constraint is to insist on an oblivious sequence of comparisons, in the sense that 
whenever we compare K t versus Kj the subsequent comparisons for the case 
K t < Kj are exactly the same as for the case Ki > Kj , but with i and j 
interchanged. 

Figure 43(a) shows a comparison tree in which this homogeneity condition is 
satisfied. Notice that every level has the same number of comparisons, so there 
are 2 m outcomes after m comparisons have been made. But n! is not a power 
of 2; some of the comparisons must therefore be redundant, in the sense that 



Fig. 43. (a) An oblivious comparison tree, (b) The corresponding network. 



5.3.4 


NETWORKS FOR SORTING 221 


one of their subtrees can never arise in practice. In other words, some branches 
of the tree must make more comparisons than necessary, in order to ensure that 
all of the corresponding branches of the tree will sort properly. 

Since each path from top to bottom of such a tree determines the entire tree, 
such a sorting scheme is most easily represented as a network ; see Fig. 43(b). 
The boxes in such a network represent “comparator modules” that have two 
inputs (represented as lines coming into the module from above) and two outputs 
(represented as lines leading downward); the left-hand output is the smaller of 
the two inputs, and the right-hand output is the larger. At the bottom of the 
network, K[ is the smallest of {ATi, K 2 , K 3 , K 4 }, K ' 2 the second smallest, etc. 
It is not difficult to prove that any sorting network corresponds to an oblivious 
comparison tree in the sense above, and any oblivious tree corresponds to a 
network of comparator modules. 

Incidentally, we may note that comparator modules are fairly easy to manu- 
facture, from an engineering point of view. For example, assume that the lines 
contain binary numbers, where one bit enters each module per unit time, most 
significant bit first. Each comparator module has three states, and behaves as 
follows: 


Time t 

Time 

{t + 1) 

State 

Inputs 

State 

Outputs 

0 

0 0 

0 

0 0 

0 

0 1 

1 

0 1 

0 

1 0 

2 

0 1 

0 

1 1 

0 

1 1 

1 

x y 

1 

x y 

2 

x y 

2 

y x 


Initially all modules are in state 0 and are outputting 0 0. A module enters 
either state 1 or state 2 as soon as its inputs differ. Numbers that begin to be 
transmitted at the top of Fig. 43(b) at time t will begin to be output at the 
bottom, in sorted order, at time t + 3, if a suitable delay element is attached to 
the K[ and K' 4 lines. 


K 1 — 4 
K 2 — 1 



1 

4 


1 

4 


1 

3 


K 3 — 3 3 

K 4 — 2 2 



2 

3 



2 

4 



1 — 
2 — 

3 — 

4 — 


K 

K 

K 

K 


Fig. 44. Another way to rep- 
resent the network of Fig. 43, 
as it sorts the sequence of four 
numbers (4, 1,3,2). 


In order to develop the theory of sorting networks it is convenient to repre- 
sent them in a slightly different way, illustrated in Fig. 44. Here numbers enter at 
the left , and comparator modules are represented by vertical connections between 
two lines; each comparator causes an interchange of its inputs, if necessary, so 
that the larger number sinks to the lower line after passing the comparator. At 
the right of the diagram all the numbers are in order from top to bottom. 


222 SORTING 


5.3.4 


Our previous studies of optimal sorting have concentrated on minimizing 
the number of comparisons, with little or no regard for any underlying data 
movement or for the complexity of the decision structure that may be necessary. 
In this respect sorting networks have obvious advantages, since the data can be 
maintained in n locations and the decision structure is “straight line” — there 
is no need to remember the results of previous comparisons, since the plan is 
immutably fixed in advance. Another important advantage of sorting networks 
is that we can usually overlap several of the operations, performing them simul- 
taneously (on a suitable machine). For example, the five steps in Figs. 43 and 44 
can be collapsed into three when simultaneous nonoverlapping comparisons are 
allowed, since the first two and the second two can be combined. We shall exploit 
this property of sorting networks later in this section. Thus sorting networks can 
be very useful, although it is not at all obvious that efficient n-element sorting 
networks can be constructed for large n; we may find that many additional 
comparisons are needed in order to keep the decision structure oblivious. 


Xl 

X2 

X3 


Xn — 1 
X n 
Xn + l 


Sorting 

network 

for 

n 

elements 


(a) 


v n + 1 


Xl 

X 2 

X3 


x n— 1 
Xn 
x n-\-l 


Sorting 

network 

for 

n 

elements 


(b) 


Fig. 45. Making (n+ l)-sorters from n-sorters: (a) insertion, (b) selection. 


There are two simple ways to construct a sorting network for n + 1 elements 
when an n-element network is given, using either the principle of insertion or 
the principle of selection. Figure 45(a) shows how the (n + l)st element can 
be inserted into its proper place after the first n elements have been sorted; 
and part (b) of the figure shows how the largest element can be selected before 
we proceed to sort the remaining ones. Repeated application of Fig. 45(a) gives 
the network analog of straight insertion sorting (Algorithm 5.2. IS), and repeated 
application of Fig. 45(b) yields the network analog of the bubble sort (Algorithm 
5.2.2B). Figure 46 shows the corresponding six-element networks. 



( a ) (b) 


Fig. 46. Network analogs of elementary internal sorting schemes, obtained by applying 
the constructions of Fig. 45 repeatedly: (a) straight insertion, (b) bubble sort. 


5.3.4 


NETWORKS FOR SORTING 223 



Fig. 47. With parallelism, straight insertion = bubble sort! 

Notice that when we collapse either network together to allow simultaneous 
operations, both methods actually reduce to the same “triangular” (2n — 3)- 
stage procedure (Fig. 47). 

It is easy to prove that the network of Figs. 43 and 44 will sort any set 
of four numbers into order, since the first four comparators route the smallest 
and the largest elements to the correct places, and the last comparator puts the 
remaining two elements in order. But it is not always so easy to tell whether or 
not a given network will sort all possible input sequences; for example, both 




Z Z! 





u 

r: 


and 






( 

c 

□ 

t= =] 

L 




are valid 4-element sorting networks, but the proofs of their validity are not triv- 
ial. It would be sufficient to test each n-element network on all nl permutations 
of n distinct numbers, but in fact we can get by with far fewer tests: 

Theorem Z ( Zero-one principle). If a network with n input lines sorts all 
2 n sequences of Os and Is into nondecreasing order, it will sort any arbitrary 
sequence of n numbers into nondecreasing order. 

Proof. (This is a special case of Bouricius’s theorem, exercise 5.3.1-12.) If f(x) 
is any monotonic function, with /( x) < f(y) whenever x < y, and if a given 
network transforms {x±, . . . , x n ) into (yi, — , y n ) , then it is easy to see that the 
network will transform (f(xi), ..., f(x n )) into (f{yi), f(y n )). If Vi > Vi+i 
for some i, consider the monotonic function / that takes all numbers < yi into 0 
and all numbers > y x into 1; this defines a sequence (f(x i), . . . , f{x n )) of Os and 
Is that is not sorted by the network. Hence if all 0-1 sequences are sorted, we 
have yi < Pi+i for 1 < i < n. | 

The zero-one principle is quite helpful in the construction of sorting net- 
works. As a nontrivial example, we can derive a generalized version of Batcher’s 
“merge exchange” sort (Algorithm 5.2.2M). The idea is to sort m + n elements by 
(i) sorting the first m and the last n independently, then (ii) applying an ( m,n )- 
merging network to the result. An (m, n)-merging network can be constructed 
inductively as follows: 

a) If m — 0 or n — 0, the network is empty. If m — n = 1, the network is a 
single comparator module. 

b) If mn > 1, let the sequences to be merged be (aq, . . . , x rn ) and (y i, . . . , y n ). 
Merge the “odd sequences” (aq, x 3 , • . • , x 2 \ m / 2 \-i) and {y u y 3 , . . . , 2/2|>/2i;«l)) 


224 SORTING 


5.3.4 


-a ff 1 -^ 

• vi - 

Zl 

+j ) * 


' 


1 

■ — 


















I 


















1 

<v I y 
-e ) ija 








- / 

0 







’ — Z8 




0 . 




[Z 

, VQ 




1 ^11 


Fig. 48 . The odd-even merge, when m = 4 and n = 7 . 

obtaining the sorted result (iq, v 2 , ..., Vf m / 2 l + fn/2])i also merge the “even 
sequences” (x 2 , x 4 , . . . , £ 2 [ m /2j ) and (y 2 , y 4 , . . . , y 2 y n / 2 \), obtaining the sorted 
result (w i, w 2 , ..., wy rn / 2j + |_n/2j)- Finally, apply the comparison-interchange 
operations 

WX-.V2, w 2 :v 3 , w 3 :v A , ..., w Lm/2J + Ln/2J :v* (i) 

to the sequence 

(v 1 ,w 1 ,v 2 ,w 2 , V 3 , W 3 ,..., V|_m/2J + |n/2J , W|_ m/2 j + (_ n/2 j , V*,V**)\ ( 2 ) 

the result will be sorted(!). Here v* = v ^ m / 2 j + [ n / 2 j+j does not exist if both m 
and n are even, and v** — 7'Lm/2j + Ln/2j+2 does not exist unless both m and n are 
odd; the total number of comparator modules indicated in (1) is |_(77?. + n- l)/2j. 

Batcher’s (m, 7r)-merging network is called the odd-even merge. A ( 4 , 7 )-merge 
constructed according to these principles is illustrated in Fig. 48 . 

To prove that this rather strange merging procedure actually works, when 
mn > 1 , we use the zero-one principle, testing it on all sequences of Os and Is. 
After the initial m-sort and 7i-sort, the sequence {sq, . . .,x m ) will consist of k 
Os followed by m-k Is, and the sequence (jq, . . . , y n ) will be l Os followed by 
n-l Is, for some k and l. Hence the sequence {vi,v 2 , . . . ) will consist of exactly 
\k/ 2 \ + [// 2 ] Os, followed by Is; and (w\, w 2 , . . . ) will consist of [k/ 2 j + [l/ 2 \ 
Os, followed by Is. Now here’s the point: 

(r*/ 2 l + \ll 21 ) - (Lfc/2J + [ 1 / 2j) = 0, 1, or 2. (3) 

If this difference is 0 or 1 , the sequence (2) is already in order, and if the 
difference is 2 one of the comparison-interchanges in (1) will fix everything up. 
This completes the proof. (Note that the zero-one principle reduces the merging 
problem from a consideration of ( m + n ) cases to only (m + l)(n + 1), represented 
by the two parameters k and /.) 

Let C(m,n) be the number of comparator modules used in the odd-even 
merge for m and n, not counting the initial m-sort and n-sort; we have 

C( if ran < 1; 

l C{\m/ 2 l [n/ 2 ])+C(|_m/ 2 j, |_n/ 2 j)+ [(m+7i — l)/2j , if mn > 1. 

( 4 ) 


5.3.4 


NETWORKS FOR SORTING 225 


This is not an especially simple function of m, and n, in general, but by noting 
that C(l,n) = n and that 

C(m + l,n + 1) — C(m,n ) 

= 1 + C(|m/2J + 1 , [n/ 2 j + l) - C([m/2j, \n/ 2j), if mn > 1 , 
we can derive the relation 

C(m + l,n + 1 ) — C(m,n) — |_lgmj + 2 + [_n / 2 L lg + 1 J , if n > m > 1 . ( 5 ) 
Consequently 

C(m,m + r) = B(m) + m + R m (r), for m > 0 and r > 0, ( 6 ) 

where B(rri) is the “binary insertion” function ^fcLiRg^l of Eq. 5 . 3 . 1- ( 3 ) - and 
where R m {r) denotes the sum of the first m terms of the series 

I r + 0 


I 11 particular, when r = 0we have the important special case 

C(m,m) = B(m) + m. (8) 

Furthermore if f = [dgm], 

R-m{ r + 2 ( ) = R m (r ) + 1 ■ 2* 1 + 2 ■ 2* ^ + ••• + 2* 1 '2° + ?n 
= Rm(r) + m + t ■ 2 t ~ l . 

Hence C(m, n + 2*) — C(m, n) has a simple form, and 

C(m, n) = ^ n + 0(1), for m fixed, n — > 00 , t = [lgm]; ( 9 ) 

the 0(1) term is an eventually periodic function of n, with period length 2*. As 
n — 00 we have C(n,n) = nlgn + O(n), by Eq. ( 8 ) and exercise 5.3.1-15. 

Minimum-comparison networks. Let S(n) be the minimum number of 
comparators needed in a sorting network for n elements; clearly S(n) > S(n), 
where S(n) is the minimum number of comparisons needed in a not-necessarily- 
oblivious sorting procedure (see Section 5.3.1). We have 5(4) = 5 = 5(4), so 
the new constraint causes no loss of efficiency when n = 4; but already when 
n = 5 it turns out that 5(5) = 9 while 5(5) = 7. The problem of determining 
5(n) seems to be even harder than the problem of determining 5(n); even the 
asymptotic behavior of 5(n) is known only in a very weak sense. 

It is interesting to trace the history of this problem, since each step was 
forged with some difficulty. Sorting networks were first explored by P. N. Arm- 
strong, R. J. Nelson, and D. G. O’Connor, about 1954 [see U.S. Patent 3029413 ]; 
in the words of their patent attorney, “By the use of skill, it is possible to 
design economical n-line sorting switches using a reduced number of two-line 
sorting switches.” After observing that 5(n + 1) < 5(n) + n, they gave special 
constructions for 4 < n < 8 , using 5, 9, 12, 18, and 19 comparators, respectively. 


r + 1 


L 2 J I 4 


+ 


r + 3 


+ 


+ ■•• + 


r + j 
2i 1 sfJ+ 1 




(7) 


226 SORTING 


5.3.4 


Then Nelson worked together with R. C. Bose to show that S( 2") < 3 n — 2 n 
for all n; hence S(n) = 0(n lg3 ) = 0(n 1585 ). Bose and Nelson published their 
interesting method in JACM 9 (1962), 282 -296, where they conjectured that it 
was best possible; T. N. Hibbard [JACM 10 (1963), 142 150] found a similar 
but slightly simpler construction that used the same number of comparisons, 
thereby reinforcing the conjecture. 

In 1964, R. W. Floyd and D. E. Knuth found a new way to approach the 
problem, leading to an asymptotic bound of the form S(n) = 0(n 1+c ^ l ° sn ). 
Working independently, K. E. Batcher discovered the general merging strategy 
outlined above. Using a number of comparators defined by the recursion 

c(l)=0, c(n) = c([~n/2]) + c(|_n/2_|) + C(["n/2], [_n/2j) for n > 2, ( 10 ) 

he proved (see exercise 5.2.2-14) that 

c(2*) = (t 2 - t + 4)2 f “ 2 - 1; 

consequently S(n) = 0(n(logn) 2 ). Neither Floyd and Knuth nor Batcher pub- 
lished their constructions until some time later [Notices of the Amer. Math. Soc. 
14 (1967), 283; Proc. AFIPS Spring Joint Computer Conf. 32 (1968), 307-314]. 

Several people have found ways to reduce the number of comparators used 
by Batcher’s merge-exchange construction; the following table shows the best 
upper bounds currently known for S(n): 

n = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

c(n) = 0 1 3 5 9 12 16 19 26 31 37 41 48 53 59 63 (n) 

S(n) < 0 1 3 5 9 12 16 19 25 29 35 39 45 51 56 60 

Since S(n) < c(n ) for 8 < n < 16, merge exchange is nonoptimal for all n > 8. 

When n < 8, merge exchange uses the same number of comparators as the 
construction of Bose and Nelson. Floyd and Knuth proved in 1964 1966 that 
the values listed for S(n) are exact when n < 8 [see A Survey of Combinatorial 
Theory (North-Holland, 1973), 163-172]; the values of S(n) for n > 8 are still 
not known. 

Constructions that lead to the values in (n) are shown in Fig. 49. The 
network for n = 9, based on an interesting three-way merge, was found by R. W. 
Floyd in 1964; its validity can be established by using the general principle 
described in exercise 27. The network for n = 10 was discovered by A. Waksman 
in 1969, by regarding the inputs as permutations of {1,2,..., 10} and trying to 
reduce as much as possible the number of values that can appear on each line at 
a given stage, while maintaining some symmetry. 

The network shown for n = 13 has quite a different pedigree: Hugues Juille 
[Lecture Notes in Comp. Sci. 929 (1995), 246-260] used a computer program 
to construct it, by simulating an evolutionary process of genetic breeding. The 
network exhibits no obvious rhyme or reason, but it works - and it’s shorter 
than any other construction devised so far by human ratiocination. 

A 62-comparator sorting network for 16 elements was found by G. Shapiro 
in 1969, and this was rather surprising since Batcher’s method (63 comparisons) 



n = 16 60 modules, delay 10 


Fig. 49. Efficient sorting networks. 

would appear to be at its best when n is a power of 2. Soon after hearing of 
Shapiro’s construction, M. W. Green tripled the amount of surprise by finding 
the 60-comparison sorter in Fig. 49. The first portion of Green’s construction 
is fairly easy to understand; after the 32 comparison/interchanges to the left of 
the dotted line have been made, the lines can be labeled with the 16 subsets of 
{a, b, c, d \ , in such a way that the line labeled s is known to contain a number less 
than or equal to the contents of the line labeled t whenever s is a subset of t. The 
state of the sort at this point is discussed further in exercise 32. Comparisons 
made on subsequent levels of Green’s network become increasingly mysterious, 
however, and as yet nobody has seen how to generalize the construction in order 
to obtain correspondingly efficient networks for higher values of n. 

Shapiro and Green also discovered the network shown for n — 12. When 
n — 11, 14, or 15, good networks can be found by removing the bottom line of 
the network for n + 1, together with all comparators touching that line. 


228 SORTING 


5.3.4 


The best sorting network currently known for 256 elements, due to D Van 
Voorhis, shows that S(256) < 3651, compared to 3839 by Batcher’s method. 
[See R. L. Drysdale and F. H. Young, SICOMP 4 (1975), 264 270.] As n -> oc, 
it turns out in fact that S(n ) = 0(n log n); this astonishing upper bound was 
proved by Ajtai, Komlos, and Szemeredi in Combinatorica 3 (1983), 1-19. The 
networks they constructed are not of practical interest, since many comparators 
were introduced just to save a factor of log n; Batcher’s method is much better, 
unless n exceeds the total memory capacity of all computers on earth! But the 
theorem of Ajtai, Komlos, and Szemeredi does establish the true asymptotic 
growth rate of S(n), up to a constant factor. 

Minimum-time networks. In physical realizations of sorting networks, and 
on parallel computers, it is possible to do nonoverlapping comparison-exchanges 
at the same time; therefore it is natural to try to minimize the delay time. A 
moment’s reflection shows that the delay time of a sorting network is equal to 
the maximum number of comparators in contact with any “path” through the 
network, if we define a path to consist of any left-to-right route that possibly 
switches lines at the comparators. We can put a sequence number on each 
comparator indicating the earliest time it can be executed; this is one higher than 
the maximum of the sequence numbers of the comparators that occur earlier on 
its input lines. (See Fig. 50(a); part (b) of the figure shows the same network 
redrawn so that each comparison is done at the earliest possible moment.) 



Fig. 50. Doing each comparison at the earliest possible time. 


Batcher’s odd-even merging network described above takes T B (m,n) units 
of time, where T B (m, 0) = T s (0,n) = 0, T B ( 1, 1) = 1, and 

T s (m,n) = 1 + max(Tg( [m/2j , |_n/2_|), T B (\m/2], \n/2])) for mn > 2. 

We can use these relations to prove that T B (m,n+ 1) > T B (m,n), by induction; 
hence T B (m,n ) = 1 +T B (\m/2], [n/2]) for mn > 2, and it follows that 

T B (m,n) = 1 + |"lgmax(m,n)"|, for mn > 1. ( 12 ) 

Exercise 5 shows that Batcher’s sorting method therefore has a delay time of 

(‘n 1 '" 1 )- C3) 

Let T(n) be the minimum achievable delay time in any sorting network for 
n elements. It is possible to improve some of the networks described above so 


71=10 


31 modules, delay 7 


n = ll 35 modules, delay 8 



n = 16 61 modules, delay 9 

Fig. 51. Sorting networks that are the fastest known, when comparisons are performed 
in parallel. 

that they have smaller delay time but use no more comparators, as shown for 
n — 6, n = 9, and n — 11 in Fig. 51, and for n = 10 in exercise 7. Still smaller 

delay time can be achieved if we add one or two extra comparator modules, as 

shown in the remarkable networks for n = 10, 12, and 16 in Fig. 51. These 
constructions yield the following upper bounds on T(n) for small n: 

n = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

T(n) <013355667 7 8 8 9 9 9 9 ^ 

For n < 10 the values given here are known to be exact (see exercise 4). The 
networks in Fig. 51 merit careful study, because it is by no means obvious 
that they always sort. Some of these networks were discovered in 1969-1971 
by G. Shapiro (n = 6, 12) and D. Van Voorhis ( n = 10, 16); the others were 
found in 2001 by Loren Schwiebert, using genetic methods (n = 9, 11). 


230 SORTING 


5.3.4 


Merging networks. Let M(m,n) denote the minimum number of comparator 
modules needed in a network that merges m elements x x < - ■■< x m with n 
elements 2/i < ■ • • < y n to form the sorted sequence z 1 < ■ ■ ■ < z m+n . At present 
no merging networks have been discovered that are superior to the odd-even 
merge described above; hence the function C(m,n) in (6) represents the best 
upper bound known for M(m,n). 

R. W. Floyd has discovered an interesting way to find lower bounds for this 
merging problem. 

Theorem F. For all n > 1, we have M(2n, 2 n) > 2 M(n, n) + n. 

Proof. Consider a network with M (2 n, 2n) comparator modules, capable of 
sorting all input sequences (zi , . . . , z 4n ) such that Zi < z 3 < ■ ■ ■ < z 4n _ 1 and 
Z 2 < z 4 < ■ ■ ■ < z 4n . We may assume that each module replaces (z^zj) by 
(min(zj, Zj), max(zi, Zj )) , for some i < j (see exercise 16). The comparators can 
therefore be divided into three classes: 

a) i < 2n and j < 2 n. 

b) i > 2n and j > 2 n. 

c) i < 2 n and j > 2 n. 

Class (a) must contain at least M(n, n ) comparators, since z 2n+ i, zo n + 2 , • • • , z 4n 
may be already in their final position when the merge starts; similarly, there 
are at least M(n,n) comparators in class (b). Furthermore the input sequence 
(0, 1, 0, 1, ... , 0, 1) shows that class (c) contains at least n comparators, since n 
zeros must move from {z 2 n+i, • • • , z 4n } to {z lr . . . , z 2n }. | 

Repeated use of Theorem F proves that M(2 m ,2 m ) > ±(m + 2)2 m ; hence 
M(n, n) > |nlg n + 0(n). We know from Theorem 5.3.2M that merging without 
the network restriction requires only M(n, n) = 2n — 1 comparisons; hence we 
have proved that merging with networks is intrinsically harder than merging in 
general. 

The odd-even merge shows that 

M(m, n) < C(m, n) — |(m + n) lgmin(m, n) + 0(m + n). 

R B. Miltersen, M. Paterson, and J. Tarui [JACM 43 (1996), 147-165] have 
improved Theorem F by establishing the lower bound 

M(m, n) > \{{rn + n ) lg(m + 1) - m/ln2) for 1 < m < n. 

Consequently M(m,n ) = \ (m + n) lgmin(m, n) + 0{m + n). 

The exact formula M(2,n) = C(2,n) = \\n\ has been proved by A. C. Yao 
and F. F. Yao [JACM 23 (1976), 566-571]. The value of M(m, n) is also known 
to equal C(m, n ) for m = n < 5; see exercise 9. 

Bitonic sorting. When simultaneous comparisons are allowed, we have seen 
in Eq. ( 12 ) that the odd-even merge uses [lg(2n)] units of delay time, when 
1 < to < n. Batcher has devised another type of network for merging, called a 


5.3.4 


NETWORKS FOR SORTING 231 


zi 

Z2 

Z3 

Z4 

Z5 

Z6 

Z7 



Fig. 52. Batcher’s bitonic sorter of order 7. 


bitonic sorter , which lowers the delay time to [lg(m + n)l although it requires 
more comparator modules. [See U.S. Patent 3428946 (1969).] 

Let us say that a sequence {z\, . . . , z p ) of p numbers is bitonic if z\ '>■■■> 
z k < ■ • ■ < z p for some k, 1 < k < p. (Compare this with the ordinary definition 
of “monotonic” sequences.) A bitonic sorter of order p is a comparator network 
that is capable of sorting any bitonic sequence of length p into nondecreasing 
order. The problem of merging aq < • • • < x m with t/i < • • • < y n is a special 
case of the bitonic sorting problem, since merging can be done by applying a 
bitonic sorter of order rn + n to the sequence (x m , . . . , xi, yi , . .. , y n ). 

Notice that when a sequence (zi, . . . , z p ) is bitonic, so are all of its sub- 
sequences. Shortly after Batcher discovered the odd-even merging networks, he 
observed that we can construct a bitonic sorter of order p in an analogous way, 
by first sorting the bitonic subsequences {zi, z 3 , z 5 , . , . ) and (z 2 , z 4 , zq, . . . } inde- 
pendently, then comparing and interchanging zi : z 2 , z 3 : Z 4 , , . . . (See exercise 10 
for a proof.) If C'(p) is the corresponding number of comparator modules, we 
have 

C'(p) = C"(("p/2]) + C'(Lp/2j) + [p/2j, for p > 2; ( 15 ) 

and the delay time is clearly [lgp] . Figure 52 shows the bitonic sorter of order 7 
constructed in this way: It can be used as a (3,4)- as well as a (2, 5)-merging 
network, with three units of delay; the odd-even merge for m = 2 and n = 5 
saves one comparator but adds one more level of delay. 

Batcher’s bitonic sorter of order 2 * is particularly interesting; it consists of 
t levels of 2 t_1 comparators each. If we number the input lines z 0 , Zi, . . . , z 2 ‘- 1 , 
element z* is compared to Zj on level l if and only if i and j differ only in the 
l th most significant bit of their binary representations. This simple structure 
leads to parallel sorting networks that are as fast as merge exchange, Algorithm 
5.2.2M, but considerably easier to implement. (See exercises 11 and 13.) 

Bitonic merging is optimum, in the sense that no parallel merging method 
based on simultaneous disjoint comparisons can sort in fewer than [lg(m + n)] 
stages, whether it works obliviously or not. (See exercise 46.) Another way to 
achieve this optimum time, with fewer comparisons but a slightly more compli- 
cated control logic, is discussed in exercise 57. 

When 1 < m < n, the nth smallest output of an (m, n)-merging network 
depends on 2 m + [m<n] of the inputs (see exercise 29). If it can be computed 
by comparators with l levels of delay, it involves at most 2 l of the inputs; hence 
2 l > 2m + [m<?r], and l > [lg(2m + [?n<n])]. Batcher has shown [Report 
GER-14122 (Akron, Ohio: Goodyear Aerospace Corporation, 1968)] that this 


232 


SORTING 


5.3.4 


xi 


y i 

V2 

2/3 

Vi 

2/5 

2/6 


< 


< 


< 


< 


: 

I 

I 

I 

I 


I 

I 

I 

I 

I 


21 

22 

23 

24 

25 


26 

27 


Fig. 53. Merging one item with six others, with multiple fanout, in order to achieve 
the minimum possible delay time. 


minimum delay time is achievable if we allow “multiple fanout” in the network, 
namely the splitting of lines so that the same number is fed to many modules 
at once. For example, one of his networks, capable of merging one item with n 
others after only two levels of delay, is illustrated for n — 6 in Fig. 53. Of course, 
networks with multiple fanout do not conform to our conventions, and it is fairly 
easy to see that any (1, n)-merging network without multiple fanout must have 
a delay time of lg(n + 1) or more. (See exercise 45.) 

Selection networks. We can also use networks to approach the problem of 
Section 5.3.3. Let Ut(n ) denote the minimum number of comparators required 
in a network that moves the t largest of n distinct inputs into t specified output 
lines^ the numbers are allowed to appear in any order on these output lines. 
Let V t (n) denote the minimum number of comparators required to move the fth 
largest of n distinct inputs into a specified output line; and let W t (n) denote the 
minimum number of comparators required to move the t largest of n distinct 
inputs into t specified output lines in nondecreasing order. It is not difficult to 
deduce (see exercise 17) that 

Ut{n) < V t (n ) < W t (n). ( 16 ) 

Suppose first that we have 2 1 elements (aq, . . . , x 2 t ) and we wish to select the 
largest t. V. E. Alekseev [ Kibernetika 5, 5 (1969), 99-103] has observed that we 
can do the job by first sorting (aq, ,..,x t ) and (x t+1 , . . . , x 2t ), then comparing 
and interchanging 

X\'.x 2 1 , aq:x 2 t_i, ..., x t :x t+ \. (17) 

Since none of these pairs can contain more than one of the largest t elements 
(why?), Alekseev’s procedure must select the largest t elements. 

If we want to select the t largest of nt elements, we can apply Alekseev’s 
procedure n - 1 times, eliminating t elements each time; hence 

U t (nt) < (n - 1) (25(f) + t). ( 18 ) 


5.3.4 


NETWORKS FOR SORTING 233 


(1,8) 

(1.7) 

(1,5) (1,5) 

(1,4) 

(1.8) 

(2.8) 

(2,7) 

(2,6) 


(2,4) 

(1.8) 

(1,7) 

(2,7) 




(2,4) 

(1,8) 

(2,8) 

(4,8) 

(4,8) 



(1,4) 

(1.8) 

(1,7) 

(1,5) (1,5) 


1 

im 

(1.8) 

(2,8) 

(2,7) 




(5,7) 

(1.8) 

(1,7) 

(2,7) 

(3,7) 



(5,7) 

(1,8) 

(2,8) 

(4,8) 

(4,8) 




Fig. 54. Separating the largest four from the smallest four. (Numbers on these lines 
are used in the proof of Theorem A.) 

Alekseev also derived an interesting lower bound for the selection problem: 
Theorem A. Ut(n) > (n — t) [lg(f + 1)] . 

Proof. It is most convenient to consider the equivalent problem of selecting the 
smallest t elements. We can attach numbers ( l,u ) to each line of a comparator 
network, as shown in Fig. 54, where l and u denote respectively the minimum 
and maximum values that can appear at that position when the input is a 
permutation of {1,2,..., n}. Let f and lj be the lower bounds on lines i and j 
before a comparison of x, : Xj , and let V t and /' be the corresponding lower bounds 
after the comparison. It is obvious that l[ = min^,/^); exercise 24 proves the 
(nonobvious) relation 

lj < h + lj- ( 19 ) 


0 

0 

0 

0 

0 

0 

1 


1 

1 


1 

0 

0 

1 

1 

2 



1 

0 

1 


2 

2 


1 


0 

0 

0 

0 0 



1 

3 

0 

1 


1 

1 


1 


3 

0 

0 

1 

1 

2 



3 

0 

► — i — 


2 

2 


3 


Fig. 55. Another interpretation for the network of Fig. 54. 

Now let us reinterpret the network operations in another way (see Fig. 55): 
All input lines are assumed to contain zero, and each “comparator” now places 
the smaller of its inputs on the upper line and the larger plus one on the lower 
line. The resulting numbers (toi,TO 2 , . . . , m n ) have the property that 




234 SORTING 


5.3.4 


Table 1 


COMPARISONS NEEDED IN SELECTION NETWORKS (U t (n), V t (n), W t (n)) 



t = 1 

t = 2 

t = 3 

t = 4 

t = 5 t = 6 

n = 1 

(0,0,0) 





n = 2 

(1,1,1) 

(0,1,1) 




n = 3 

(2,2,2) 

(2,3,3) 

(0,2,3) 



n = 4 

(3,3,3) 

(4,5,5) 

(3,5,5) 

(0,3,5) 


n = 5 

(4,4,4) 

(6,7,7) 

(6,7,8) 

(4,7,9) 

(0,4,9) 

n = 6 

(5,5,5) 

(8,9,9) 

(8,10,10) 

(8,10,12) 

(5,9,12) (0,5,12) 


throughout the network, since this holds initially and it is preserved by each 
comparator because of ( 19 ). Furthermore, the final value of 

mi + m 2 + • • • + m n 

is the total number of comparators in the network, since each comparator adds 
unity to this sum. 

If the network selects the smallest t numbers, n — t of the l t are > t + 1; 
hence n — t of the m* must be > [ lg(t + 1 )] . | 

The lower bound in Theorem A turns out to be exact when t = 1 and when 
t = 2 (see exercise 19). Table 1 gives some values of U t (n), V t (n), and W t (n) for 
small t and n. Andrew Yao [Ph.D. thesis, U. of Illinois (1975)] determined the 
asymptotic behavior of (7 t (n) for fixed t, by showing that 1 / 3 ( 71 ) = 2n+lg n+0(l) 
and U t (n) = n[lg(f + 1)] + 0((logn)L lg *J) as n -> 00 ; the minimum delay time 
is lgn + |_lgtj lglgn + O (log log log n). N. Pippenger [SICOMP 20 (1991), 878- 
887] has proved by nonconstructive methods that for any e > 0 there exist 
selection networks with U\ n / 2 ] ( n ) < (2 + e)nlgn, whenever n is sufficiently large 
(depending on e). 

EXERCISES — First Set 

Several of the following exercises develop the theory of sorting networks in detail, and 
it is convenient to introduce some notation. We let [i:j] stand for a comparison/ 
interchange module. A network with n inputs and r comparator modules is written 
[*1 -ji][i 2 : jb] • ■ ■ [A -j r ], where each of the V s and j’s is < n; we shall call it an n-network 
for short. A network is called standard if i q < j q for 1 < q < r. Thus, for example, 
Fig. 44 on page 221 depicts a standard 4-network, denoted by the comparator sequence 
[1 : 2] [,3 : 4] [1 : 3] [2 : 4] [2:3], 

The text’s convention for drawing network diagrams represents only standard 
networks; all comparators [i'-j] are represented by a line from i to j, where i < j. When 
nonstandard networks must be drawn, we can use an arrow from i to j, indicating that 
the larger number goes to the point of the arrow. For example, Fig. 56 illustrates a 
nonstandard network for 16 elements, whose comparators are [1 : 2] [4 : 3] [5 : 6][8 : 7] 
Exercise 11 proves that Fig. 56 is a sorting network. 

If x = (ii,..,,r„) is an n-vector and a is an n-network, we write xa for the 
vector of numbers j(ra)i, . . . , (xa) n ) produced by the network. For brevity, we also let 
aVb = max(a,6), aAb = min(a,f>), a = 1— a. Thus (x[i:j])i = x t Axj, (x[i:j])j = XiVXj, 



5.3.4 


NETWORKS FOR SORTING 235 


Stage 1 Stage 2 Stage 3 Stage 4 



Fig. 56. A nonstandard sorting network based on bitonic sorting. 


and (x[i:j])k = Xk when i / k ^ j. We say a is a sorting network if (xa)i < (xa)i+i 
for all x and for 1 < i < n. 

The symbol eW stands for a vector that has 1 in position i , 0 elsewhere; thus 
(ej’*)] = 5ij. The symbol D n stands for the set of all 2 n n-place vectors of Os and Is, 
and P n stands for the set of all n! vectors that are permutations of {1,2, . . . , n). We 
write x Ay and iVt/ for the vectors ( X\ A yi, . . . , x n A y n ) and (x\ V x„ V y„), 

and we write x C y if n < yi for all i. Thus x C y if and only if as V y = y if and only if 
x A y = x. If x and y are in D n , we say that x covers y if x s= (y V eW) ^ y for some i. 
Finally for all x in D n we let v{x) be the number of Is in x, and ((x) the number of Os; 
thus is(x) + C(-c) = n. 

1. [20] Draw a network diagram for the odd-even merge when m = 3 and n = 5. 

2. [22] Show that V. Pratt’s sorting algorithm (exercise 5.2.1-30) leads to a sorting 
network for n elements that has approximately (log 2 n)(log 3 n) levels of delay. Draw 
the corresponding network for n = 12. 

3. [M20] (K. E. Batcher.) Find a simple relation between C(m,m—1) and C(m,m). 

► 4. [M23] Prove that T( 6) = 5. 

5. [ M16 ] Prove that (13) is the delay time associated with the sorting network 
outlined in (10). 

6. [28] Let T(n) be the minimum number of stages needed to sort n distinct numbers 
by making simultaneous disjoint comparisons (without necessarily obeying the network 
constraint); such comparisons can be represented as a node containing a set of pairs 
{*i : ji ,*2 : J 2 , • . . , ir'-jr} where ii, ji, 12, J2, • ■ ■ 1 ir,jr are distinct, with 2 r branches below 
this node for the respective cases 

(K n < K n , K l 2 <K J2 ,..., K ir < Kj r ), 

(K tl > K h , K i2 < K ]2 ,...,K ir <K jr ), 

Prove that T( 5) = T(6) = 5. 


etc. 



5.3.4 


NETWORKS FOR SORTING 237 


7. [25] Show that if the final three comparators of the network for n = 10 in Fig. 49 
are replaced by the “weaker” sequence [5 : 6 ] [4: 5] [ 6 : 7], the network will still sort. 

8 . [M20] Prove that M(mi+m 2 , ni+n 2 ) > M(mi,m ) + M(ra 2 ,rc 2 ) + min(mi,n 2 ), 
for mi,m 2 ,ni,n 2 > 0 . 

9. [M25] (R. W. Floyd.) Prove that M( 3,3) = 6 , M(4,4) = 9, M{ 5,5) = 13. 

10. [M22\ Prove that Batcher’s bitonic sorter, as defined in the remarks preceding 
( 15 ), is valid. [Hint: It is only necessary to prove that all sequences consisting of k Is 
followed by l Os followed by n — k — l Is will be sorted.] 

11. [M23] Prove that Batcher’s bitonic sorter of order 2‘ will not only sort sequences 
{ 20 , 21 , . . . , z 2 t_ 1 ) for which zo > • • • > Zfc < • • ■ < z 2 t_ 1; it also will sort any sequence 
for which zq < ■ ■ • < z k > • • • > z 2 t_ x . [As a consequence, the network in Fig. 56 will 
sort 16 elements, since each stage consists of bitonic sorters or reverse-order bitonic 
sorters, applied to sequences that have been sorted in opposite directions.] 

12. [M20] Prove or disprove: If x and y are bitonic sequences of the same length, so 
are xV y and x A y. 

► 13. [24] (H. S. Stone.) Show that a sorting network for 2* elements can be constructed 
by following the pattern illustrated for t = 4 in Fig. 57. Each of the t 2 steps in this 
scheme consists of a “perfect shuffle” of the first 2 t ~ 1 elements with the last 2 t_1 , 
followed by simultaneous operations performed on 2 f_1 pairs of adjacent elements. 
Each of the latter operations is either “0” (no operation), “+” (a standard comparator 
module), or ” (a reverse comparator module). The sorting proceeds in t stages of 
t steps each; during the last stage all operations are “+”. During stage s, for s < t, we 
do t — s steps in which all operations are “ 0 ”, followed by s steps in which the operations 
within step q consist alternately of 2 q ~ 1 “+” followed by 2 q ~ 1 for q = 1 , 2 , . . . , s. 

[Note that this sorting scheme could be performed by a fairly simple device whose 
circuitry performs one “shuffle-and-operate” step and feeds the output lines back into 
the input. The first three steps in Fig. 57 could of course be eliminated; they have 
been retained only to make the pattern clear. Stone notes that the same pattern 
“shuffle/operate” occurs in several other algorithms, such as the fast Fourier transform 
(see 4 . 6 . 4 -( 4 o)).] 

► 14. [M27] (V. E. Alekseev.) Let a = [ii : ji] . . . [i r :j r ] be an n-network; for 1 < s < r 
we define a s = [ii :j [] . . . [ii_ x : j' s -\][i s : j s ] . . . [i r :j r ], where the i' k and j' k are obtained 
from i k and jk by changing i s to j s and changing j s to i s wherever they appear. For 
example, if a = [1 : 2] [3 : 4] [1 : 3] [2 : 4] [2:3], then a 4 — [1 : 4] [3 : 2] [1 : 3] [2 : 4] [2:3]. 

a) Prove that D n a = D n (a s ). 

b) Prove that (a 3 )* = (a*) s - 

c) A conjugate of a is any network of the form (. . . ((a si ) S2 ) . . . ) Sfc . Prove that a has 
at most 2 r_1 conjugates. 

d) Let g a (x) = [x € D n a\, and let f a (x) = (x il V Xj 1 ) A ••• A (x ir V xj r ). Prove that 
g a (x ) = \/{fa'(x) | a' is a conjugate of a}. 

e) Let G a be the directed graph with vertices {1, . . . ,n} and with arcs i s — > j s for 
1 < s < r. Prove that a is a sorting network if and only if G Q < has an oriented 
path from i to i + 1 for 1 < i < n and for all a' conjugate to a. [This condition is 
somewhat remarkable, since G a does not depend on the order of the comparators 
in a.] 

15. [20] Find a nonstandard sorting network for four elements that has only five 
comparator modules. 


238 SORTING 


5.3.4 


16. [M2 2] Prove that the following algorithm transforms any sorting network [i, : 

■ . .[i r :j r \ into a standard sorting network of the same length: 

Tl. Let q be the smallest index such that i q > j q . If no such index exists, stop. 

T2. Change all occurrences of i q to j q , and all occurrences of j q to i q , in all 
comparators [i s : j s ] for q < s < r. Return to Tl. | 

Thus, [4: 1] [3: 2] [1:3] [2:4] [1 : 2] [3: 4] is first transformed into [1:4] [3: 2] [4: 3] [2:1] [4: 2] [3:1],. 
then [ 1 : 4] [2 : 3] [4 : 2] [3 : 1] [4 : 3] [2 : 1] , then [1 : 4] [2 : 3] [2 : 4] [3 : 1 ] [2 : 3] [4 : 1 ] , etc., until the 
standard network [1 : 4] [2 : 3] [2 : 4] [1 : 3] [1 : 2] [3 : 4] is obtained. 

17. [ M25 ] Let D tn be the set of all (") sequences (xi,...,x n ) of Os and Is having 

exactly t Is. Show that U t (n) is the minimum number of comparators needed in a 
network that sorts all the elements of Dt n ', Vt(n) is the minimum number needed to 
sort D tn U and W t (n) is the minimum number needed to sort |J 0<fc<f D kn - 

► 18. [M20] Prove that a network that finds the median of 2t — 1 elements requires at 
least (t-l)[lg(t+l)] + [lgt] comparator modules. [Hint: See the proof of Theorem A.] 

19. [M22] Prove that U 2 (n) = 2n - 4 and V 2 (n) = 2n - 3, for all n > 2. 

20. [28] Prove that (a) V 3 (5) = 7; (b) U 4 {n) < 3n - 10 for n > 6. 

21. [21] True or false: Inserting a new standard comparator into any standard sorting 
network yields another standard sorting network. 

22. [Ml 7] Let a be any n-network, and let x and y be n-vectors. 

a) Prove that x C y implies that xa C ya. 

b) Prove that x-y < (xa)-(ya), where x-y denotes the dot product iij/H Yx n y n . 

23. [Ml 8 ] Let a be an n-network. Prove that there is a permutation p G P n such 
that ( pa)i = j if and only if there are vectors x and y in D n such that x covers y, 
(xa)t = 1, ( ya)i = 0, and ((y) = j. 

► 24. [M21] (V. E. Alekseev.) Let a be an n-network, and for 1 < k < n let 

Ik = min{(pa)fc | p € P n }, u k = max{(pa) fc | p E P n ] 

denote the lower and upper bounds on the range of values that may appear in line k of 
the output. Let l' k and u' k be defined similarly for the network a' = a[i:j]. Prove that 

li — h A lj, lj < h + lj , u'i > Ui + Uj — (n + 1), u'j = Wi V Uj. 

[Hint: Given vectors a: and y in D n with {xa)i = ( ya)j = 0, ((x) = and £ (y) = l q , 
find a vector 2 in D n with { za ')j = 0, C( z ) < U + lj ■] 

25. [M30] Let l k and Uk be as defined in exercise 24. Prove that all integers between 
Ik and Uk inclusive are in the set {(pa) k \ p in P n } . 

26. [M/ 24 ] (R. W. Floyd.) Let a be an n-network. Prove that one can determine the 
set D n a = {a:Q | x in D n } from the set P n a = {pa p in P„}; conversely, P n a can be 
determined from D n a. 

► 27. [ M20 ] Let x and y be vectors, and let xa and ya be sorted. Prove that {xa), < 
(ya)j if and only if, for every choice of j elements from y, we can choose i elements 
from x such that every chosen x element is < some chosen y element. Use this principle 
to prove that if we sort the rows of any matrix, then sort the columns, the rows will 
remain in order. 


5.3.4 


NETWORKS FOR SORTING 239 


► 28. [M20] The following diagram illustrates the fact that we can systematically write 
down formulas for the contents of all lines in a sorting network in terms of the inputs: 


a 

b 

c 

d 


a A b 
a V b 
c A d 
c V d 


(a A b) A (c A d) 
(a V b) A (c V d) 
(a A b) V (c A d) 
(a V b) V (c V d) 


(a A 6) A (c A d) 

((a V 6) A (c V d)) A ((a A b) V (c A d)) 
((a V b) A (c V d)) V ((a A 6) V (c A d)) 
(aVt)V(cV d) 


Using the commutative laws xAj/ = yf\x, x\/y — jVi, the associative laws x/\(y/\z) = 
(x A 2 /) A 2 , 3 )V(jVz) = (x V y) V z, the distributive laws x A (yVz) = (iA|/)V(iA 2 ), 
iV(jAz) = (1 V j) A (1 V 2 ), the absorption laws 1 A (1 V t/) = 1 V (1 A j) = a:, 
and the idempotent laws iAi = iVi=i, we can reduce the formulas at the right 
of this network to (a A b A c A d), (a A 6 A c) V (a A b A d) V (a A c A d) V (6 A c A d), 
(a A b) V (a A c) V (a A d) V (b A c) V (b A d) V (c A d), and a V b V c V d, respectively. 

Prove that, in general, the fth largest element of {xi, . . . , x n j is given by the 
“elementary symmetric function” 


(Tt(x 1 , . . . ,x„) = \J {x il A x i2 A • • • A x it | 1 < ii < i 2 < ■ • • < it < n }. 

[There are (”) terms being V’d together. Thus the problem of finding minimum-cost 
sorting networks is equivalent to the problem of computing the elementary symmetric 
functions with a minimum of “and/or” circuits, where at every stage we are required 
to replace two quantities <p and ip by <p A ip and cp V ip.] 

29. [ M20 ] Given that xi < X 2 < X 3 and 2/1 < 2/2 < J /3 < 2/4 < ys, and that z\ < 22 < 

■ • • < zg is the result of merging the x’s with the y' s, find formulas for each of the z’s 
in terms of the x’s and the y' s, using the operators A and V. 

30. [ HM2J, ] Prove that any formula involving A and V and the independent variables 
{ .rj , . . . . :r„ } can be reduced using the identities in exercise 28 to a “canonical” form 
Ti V T 2 V • • • V Tfc, where k > 1, each r, has the form f\{xj \ j in Si} where Si is a 
subset of {1, 2, . . . ,n}, and no set Si is included in Sj for i ^ j ■ Prove also that two 
such canonical forms are equal for all x \, , ,. . , x n if and only if they are identical (up to 
order) . 

31. [ M24 ] (R- Dedekind, 1897.) Let S n be the number of distinct canonical forms on 
xi, . . . , x n in the sense of exercise 30. Thus <5i = 1, 62 = 4, and <53 = 18. What is 64 ? 

32. [M28] (M. W. Green.) Let Gi = {00,01, 11}, and let Gt+i be the set of all strings 
dtpipoj such that 6 , cp , ip, w have length 2 f_1 and 8 <p, ipuj , dip, and <puj are in Gt. Let 
a be the network consisting of the first four levels of the 16-sort.er shown in Fig. 49. 
Show that Diea = G 4 , and prove that it has exactly 54 + 2 elements. (See exercise 31.) 

► 33. [M22] Not all 5 n of the functions of {xi, . . . ,x n ) in exercise 31 can appear in 
comparator networks. In fact, prove that the function ( 2:1 A X 2 ) V (X 2 A X 3 ) V (X 3 A X 4 ) 
cannot appear as an output of any comparator network on (xi, . . . , x n ). 

34. [23] Is the following a sorting network? 




n=r 


240 SORTING 


5.3.4 


35. [20] Prove that any standard sorting network must contain each of the adjacent 
comparators [i:»+l], for 1 < i < n, at least once. 

► 36. [22] The network of Fig. 47 involves only adjacent comparisons [i:i+l]; let us call 
such a network primitive. 

a) Prove that a primitive sorting network for n elements must have at least (") 
comparators. [Hint: Consider the inversions of a permutation.] 

b) (R. W. Floyd, 1964.) Let a be a primitive network for n elements, and let i be a 
vector such that {xa) t > (xa)j for some i < j. Prove that (ya)i > ( ya)j , where 
y is the vector (n,n— 1, . . . , 1). 

c) As a consequence of (b), a primitive network is a sorting network if and only if it 
sorts the single vector (n, n— 1, . . . , 1). 

37. [M2 2] The odd-even transposition sort for n numbers, n > 3, is a network n levels 
deep with |n(n - 1) comparators, arranged in a brick-like pattern as shown in Fig. 58. 
(When n is even, there are two possibilities.) Such a sort is especially easy to implement 
in hardware, since only two kinds of actions are performed alternatively. Prove that 
such a network is, in fact, a valid sorting network. [Hint: See exercise 36.] 




n = 5 n=6 n = 6 

Fig. 58. The odd -even transposition sort. 

► 38. [43] Let N = (”). Find a one-to-one correspondence between Young tableaux of 
shape (n— l,n— 2, . . . , 1) and primitive sorting networks [n : ii+1] ... [in :iiv+l]. [Con- 
sequently by Theorem 5.1.4H there are exactly 

N\ 

l"-i 3 n - 2 5" — 3 ... (2n — 3) 1 

such sorting networks.] Hint: Exercise 36(c) shows that primitive networks without 
redundant comparators correspond to paths from 1 2 ... n to n ... 2 1 in polyhedra like 
Fig. 1 in Section 5.1.1. 

39. [25] Suppose that a primitive comparator network on n lines is known to sort the 
single input 1010 ... 10 correctly. (See exercise 36; assume that n is even.) Show that 
its “middle third,” consisting of all comparators that involve only lines \n/ 3] through 
[2n/3] inclusive, will sort all inputs. 

40. [HM44] Comparators [ii : *i +1 ] [*2 T 2 + I] . . . [i r :i r + 1] are chosen at random, with 
each value of ik 6 {1, 2, . . . , n — 1} equally likely; the process stops when the network 
contains a bubble sort configuration like that of Fig. 47 as a subnetwork. Prove that 
r < 4n 2 + 0(n 3 / 2 logn), except with probability O(n~ 100 °). 

41. [M47] Comparators [ii : Ji ] [*2 -j 2 ] . . . [i r '■ jr ] are chosen at random, with each irre- 
dundant choice 1 < ik < jk < n equally likely; the process stops when a sorting network 
has been obtained. Estimate the expected value of r; is it 0(n 1+e ) for all e > 0? 

► 42. [25] (D. Van Voorhis.) Prove that S(n) > S(n — 1) + [lgn] . 




5.3.4 


NETWORKS FOR SORTING 241 


43. [48] Find an (m, n)-merging network with fewer than C(m,n) comparators, or 
prove that no such network exists. 

44. [50] Find the exact value of S(n) for some n > 8. 

45. [M20] Prove that any (1, n)-merging network without multiple fanout must have 
at least [lg(n + 1)] levels of delay. 

► 46. [30] (M. Aigner.) Show that the minimum number of stages needed to merge m 
elements with n, using any algorithm that does simultaneous disjoint comparisons as in 
exercise 6, is at least \ lg(m+n)] ; hence the bitonic merging network has optimum delay. 

47. [4 7] Is the function T(n) of exercise 6 strictly less than T(n) for some nl 

► 48. [26] We can interpret sorting networks in another way, letting each line carry 
a multiset of m numbers instead of a single number; under this interpretation, the 
operation [i:j] replaces xt and Xj, respectively, by x t f\Xj and x % xj, the least m and 
the greatest m of the 2m numbers Xi l+J Xj. (For example, the diagram 


3,5}— -—{1, 
1’ 8} {5, 


■{3,5} 

{ 

-{2,9} 

{2,7} 


{1,3} 
8 } 
{2,9} 
{2 


,9}-p{2, 
, 7} * {7, 


{1,3}- 
{5,8}- 
{ 2 , 2 }- 
9}- 


{ 1 , 2 } 

■{5,8}- 

{2,3}- 

-{7,9}- 


{ 1 , 2 } { 1 , 2 } — 

{5,7} — {2,3} — 
{2,3} — i— {5,7} — 
{8,9} {8,9} — 


illustrates this interpretation when m = 2; each comparator merges its inputs and 
separates the lower half from the upper half.) 

If a and b are multisets of m numbers each, we say that a -C b if and only if 
a fcb = a (equivalently, aty b = b\ the largest element of a is less than or equal to the 
smallest of b). Thus a ^ 6 <C a b. 

Let a be an n- network, and let x = {x\ , . . . ,x n ) be a vector in which each Xi is a 
multiset of m elements. Prove that if (ia)i is not <C (xa)j in the interpretation above, 
there is a vector y in D n such that ( ya) t = 1 and ( ya)j = 0. [Consequently, a sorting 
network for n elements becomes a sorting network for mn elements if we replace each 
comparison by a merge network with M(m, m) modules. Figure 59 shows an 8-element 
sorter constructed from a 4-element sorter by using this observation.] 


=x^: 


rn — 

— T — •- 




r t 

— » . 

— T 

k 

f- 







— * « 

M — 4 





-I- 

l 1 



Fig. 59. An 8-sorter constructed from a 4-sorter, by using the merging interpretation. 


49. [ M23 ] Show that, in the notation of exercise 48, (x y) z = x (y z) and 
(x V y) V z = x V (y V 2 ); however (x V y) b z i s n °t always equal to (a; ^ z) V [y h z )i 
and (x y) V (x ft z) V (y ^ z) does not always equal the middle m elements of x l±l y t±J z. 
Find a correct formula, in terms of x , y, z and the ^ and V operations, for those middle 
elements. 


242 SORTING 


5.3.4 


50. [HM 46 ] Explore the properties of the ^ and V operations defined in exercise 48. 
Is it possible to characterize all of the identities in this algebra in some nice way. or 
to derive them all from a finite set of identities? In this regard, identities such as 
x fax fax = x fax, or a: ^ (x V (z A fa V y))) = * A {x V y), which hold only for to < 2. 
are of comparatively little interest; consider only the identities that are true for all m. 

► 51. [M25] (R. L. Graham.) The comparator [i:j] is called redundant in the network 
ai[i:j]a 2 if either (xa x )i < (xai)j for all vectors x, or (xa x ) l > (xai)j for all 
vectors x. Prove that if a is a network with r irredundant comparators, there are 
at least r distinct ordered pairs (i.j) of distinct indices such that (xa) l < ( xa)j for all 
vectors x. (Consequently, a network with no redundant comparators contains at most 
( 2 ) modules.) 


£ 




o 

£ 

0) 


Fig. 60. A family of networks whose ability to sort is difficult to verify, illustrated for 
m = 3 and n = 5. (See exercise 52.) 


► 52. [32] (M. O. Rabin, 1980.) Prove that it is intrinsically difficult to decide in 
general whether a sequence of comparators defines a sorting network, by considering 
networks of the form sketched in Fig. 60. It is convenient to number the inputs x 0 to 
xn, where N = 2 rrin + m + 2 n; the positive integers m and n are parameters, The 
first comparators are [j:j + 2nk\ for 1 < j < 2 n and 1 < k < m. Then we have 
[2j"l :2j][0:2j] for 1 < j < n, in parallel with a special subnetwork that uses only 
indices > 2 n. Next we compare [0 : 2mn + 2n+j] for 1 < j < m. And finally there is 
a complete sorting network for (x x , . . . ,x N ), followed by [0 : 1] [1 : 2] . . . [N — t— I : jV — t], 
where f = mn + n + 1. 

a) Describe all inputs (&'o, x x , . . . , xjv) that are not sorted by such a network, in terms 
of the behavior of the special subnetwork. 

b) Given a set of clauses such as (y x V y 2 V y 3 ) A (y 2 V y 3 V y 4 ) A . . . , explain how 
to construct a special subnetwork such that Fig. 60 sorts all inputs if and only if 
the clauses are unsatisfiable. [Hence the task of deciding whether a comparator 
sequence forms a sorting network is co-NP-complete, in the sense of Section 7.9.] 


5.3.4 


NETWORKS FOR SORTING 243 


53. [30] (Periodic sorting networks.) The following two 16-networks illustrate general 
recursive constructions of f-level networks for n = 2* in the case t = 4: 



(a) (b) 


If we number the input lines from 0 to 2' - 1, the 1th level in case (a) has comparators 
[i\j] where i mod 2 t+1 ~' < 2*“' and j = i ® (2 t+1 ~' - 1); there are t2‘~ 1 comparators 
altogether, as in the bitonic merge. In case (b) the first-level comparators are [2 j : 2 j + 1] 
for 0 < j < 2 <_1 , and the Ith-level comparators for 2 < l < t are [2 j + 1:2 j + 2 t+1 ~ l ] 
for 0 < j < 2 t_1 - 2 i_i j there are (t — 1)2* 1 + 1 comparators altogether, as in the 
odd-even merge. 

If the input numbers are 2 fc -ordered in the sense of Theorem 5.2. 1H, for some 
k > 1, prove that both networks yield outputs that are 2 fc_1 -ordered. Therefore we 
can sort 2 * numbers by passing them through either network t times. [When t is large, 
these sorting networks use roughly twice as many comparisons as Algorithm 5.2.2M; 
but the total delay time is the same as in Fig. 57, and the implementation is simpler 
because the same network is used repeatedly.] 

54. [42] Study the properties of sorting networks made from ?n-sorter modules instead 
of 2-sorters. (For example, G. Shapiro has constructed the network 



which sorts 16 elements using fourteen 4-sorters. Is this the best possible? Prove that 
m 2 elements can be sorted with at most 16 levels of m-sorters, when m is sufficiently 
large.) 

55. [23] A permutation network is a sequence of modules [ i \ : ji] . . . [i r :j r ] where each 
module [i:j] can be set by external controls to pass its inputs unchanged or to switch 
x i and Xj (irrespective of the values of xt and Xj), and such that each permutation 


244 SORTING 


5.3.4 


of the inputs is achievable on the output lines by some setting of the modules. Every 
sorting network is clearly a permutation network, but the converse is not true: Find a 
permutation network for five elements that has only eight modules. 

► 56 . [25] Suppose the bit vector x £ D n is not sorted. Show that there is a standard 
n-network a x that fails to sort x, although it sorts all other elements of D n . 

57 . [M35] The even-odd merge is similar to Batcher’s odd-even merge, except that 
when mn > 2 it recursively merges the sequence {x m mo d 2 +i> • • ■ , x m - 3 , Xm-i) with 
{yi i V3 , * * * , y 2 \n/2] — l) and mod 2+11 •••; ^m—2, Xm} with (f/ 2 , y 4 , . . . , r/ 2 [n/ 2 j ) be- 

fore making a set of [m/2] + \n/ 2] — 1 comparison-interchanges analogous to ( 1 ). 
Show that the even-odd merge achieves the optimum delay time [lg(m + n)] of bitonic 
merging, without making more comparisons than the bitonic method. In fact, prove 
that the number of comparisons A(m, n) made by even-odd merging satisfies C(ra, n) < 
A(m, n ) < | (m + n) lgmin(m, n) + m + |n. 

EXERCISES — Second Set 

The following exercises deal with several different types of optimality questions related 
to sorting. The first few problems are based on an interesting “multihead” general- 
ization of the bubble sort, investigated by P. N. Armstrong and R. J. Nelson as early 
as 1954. [See U.S. Patents 3029413, 3034102 .] Let 1 = hi < h, 2 <■■■ < h m = n be 
an increasing sequence of integers; we shall call it a “head sequence” of length m and 
span n, and we shall use it to define a special kind of sorting method. The sorting of 
records R\ . . . R_\ proceeds in several passes, and each pass consists of N + n — 1 steps. 
On step j, for j = 1 - n, 2 - n, . . . , N - 1, the records R J+h[1] , R J+h[2] , . . ,,R j+h[m] 
are examined and rearranged if necessary so that their keys are in order. (We say 
that Rj+hji], . . . , Rj +h [ m ] are “under the read-write heads.” When j + h[k] is < 1 or 
> N, record Rj +h ^ is left out of consideration; in effect, the keys K 0 , K-i, K - 2 , • • • are 
treated as — oo and Rjv+ii Km + 2 , ■ • ■ are treated as +oo. Therefore step j is actually 
trivial when j < —h[m — 1] or j > N — h[ 2].) 

For example, the following table shows one pass of a sort when m = 3, N = 9, 
and hi = 1, h 2 = 2, h 3 = 4: 



R —2 

K - 1 K 0 K i 

k 2 

k 3 

k 4 

K 5 

K 6 

k 7 

Ks 

Kg K io Kn K i 2 

j 

= -3 

3 

1 

4 

5 

9 

2 

6 

8 

7 

j 

= -2 

3 

1 

4 

5 

9 

2 

6 

8 

7 

j 

= -1 

3 

1 

4 

5 

9 

2 

6 

8 

7 

j 

= 0 

1 

3 

4 

5 

9 

2 

6 

8 

7 

j 

= 1 

1 

3 

4 

5 

9 

2 

6 

8 

7 

j 

= 2 

1 

3 

2 

4 

9 

5 

6 

8 

7 

j 

= 3 

1 

3 

2 

4 

6 

5 

9 

8 

7 

j 

= 4 

1 

3 

2 

4 

5 

6 

9 

8 

7 

j 

= 5 

1 

3 

2 

4 

5 

6 

7 

8 

9 

j 

= 6 

1 

3 

2 

4 

5 

6 

7 

8 

9 

j 

= 7 

1 

3 

2 

4 

5 

6 

7 

8 

9 

j 

= 8 

1 

3 

2 

4 

5 

6 

7 

8 

9 

When 

m = 2, hi 

= 1, and h 2 = 

2, this multihead method reduces to the bubble sort 


(Algorithm 5.2.2B). 


5.3.4 


NETWORKS FOR SORTING 245 


58. [21] (James Dugundji.) Prove that if h[k + 1] = h[k] + 1 for some k, 1 < k < m, 
the multihead sorter defined above will eventually sort any input file in a finite number 
of passes. But if h[k + 1] > h[k] + 2 for 1 < k < m, the input might never become 
sorted. 

► 59. [30] (Armstrong and Nelson.) Given that h[k + 1] < h[k] + k for 1 < k < m, and 
N >n — 1 , prove that the largest n — 1 elements always move to their final destination 
on the first pass. [Hint: Use the zero-one principle; when sorting Os and Is, with fewer 
than n Is, prove that it is impossible to have all heads sensing a 1 unless all Os lie to 
the left of the heads.] 

Prove that sorting will be complete in at most \(N — 1 )/(n - 1)] passes when the 
heads satisfy the given conditions. Is there an input file that requires this many passes? 

60. [26] If n = TV, prove that the first pass can be guaranteed to place the smallest 
key into position R x if and only if h[k + 1] < 2h[k] for 1 < k < m. 

61. [34] (J- Hopcroft.) A “perfect sorter” for N elements is a multihead sorter 
with N = n that always finishes in one pass. Exercise 59 proves that the sequence 
(hi, h 2 , h.3, h 4 , . . . , h m ) = (l, 2, 4, 7, . . . , 1 + (™)) gives a perfect sorter for N = (™) + 1 
elements, using m = (\/8iV — 7 + 1)/2 heads. For example, the head sequence (1, 2, 4, 7, 
11, 16, 22) is a perfect sorter for 22 elements. 

Prove that, in fact, the head sequence (1,2,4,7,11,16,23) is a perfect sorter for 
23 elements. 

62. [49] Study the largest N for which m-head perfect sorters exist, given m. Is 
N = 0{m 2 )l 

63. [23] (V. Pratt.) When each head hk is in position 2 k ~ 1 for 1 < k < m, how many 
passes are necessary to sort the sequence z x z 2 . . ,z 2 m-i of Os and Is where zj = 0 if 
and only if j is a power of 2? 

64. [24] ( Uniform sorting.) The tree of Fig. 34 in Section 5.3.1 makes the comparison 
2:3 in both branches on level 1, and on level 2 it compares 1:3 in each branch unless 
that comparison would be redundant. In general, we can consider the class of all sorting 
algorithms whose comparisons are uniform in that way; assuming that the M = (^) 
pairs {(a, b) \ 1 < a < b < N} have been arranged into a sequence 

(ai, h), (a 2 , b 2 ), . . . , (am, &m), 

we can successively make each of the comparisons K ai : K h 1 , K a2 : K b2 , ... whose 
outcome is not already known. Each of the M! arrangements of the (a, b) pairs defines a 
uniform sorting algorithm. The concept of uniform sorting is due to H. L. Beus [JACM 
17 (1970), 482-495], whose work has suggested the next few exercises. 

It is convenient to define uniform sorting formally by means of graph theory. Let 
G be the directed graph on the vertices {1,2,..., iV} having no arcs. For i = 12, 
. . . , M we add arcs to G as follows: 

Case 1. G contains a path from Oi to bj. Add the arc a x — > b x to G. 

Case 2. G contains a path from b, to a;. Add the arc bj — >■ a, to G. 

Case 3. G contains no path from a t to b; or b; to a t . Compare K ai :K bi \ then add 

the arc a % ->■ bt to G if K ai < K bi , the arc b, -» a x if K ai > K bi . 

We are concerned primarily with the number of key comparisons made by a uniform 
sorting algorithm, not with the mechanism by which redundant comparisons are ac- 
tually avoided. Thus the graph G need not be constructed explicitly; it is used here 
merely to help define the concept of uniform sorting. 


246 SORTING 


5.3.4 


We shall also consider restricted uniform sorting, in which only paths of length 2 
are counted in cases 1, 2, and 3 above. (A restricted uniform sorting algorithm may 
make some redundant comparisons, but exercise 65 shows that the analysis is somewhat 
simpler in the restricted case.) 

Prove that the restricted uniform algorithm is the same as the uniform algorithm 
when the sequence of pairs is taken in lexicographic order 

(1j 2) ( 1 , 3) ( 1 , 4) . . . (l,iV)(2,3)(2,4) . . . (2,JV) . . . (N-1,N). 

Show in fact that both algorithms are equivalent to quicksort (Algorithm 5.2.2Q) when 
the keys are distinct and when quicksort’s redundant comparisons are removed as in 
exercise 5.2.2-24. (Disregard the order in which the comparisons are actually made in 
quicksort; consider only which pairs of keys are compared.) 

65. [M38] Given a pair sequence (<zi, fci) . . . (a M , 6m) as in exercise 64, let c, be the 
number of pairs (j, k) such that j < k < i and ( a t ,bi ), ( a.j,bj ), (a*,, bk) forms a triangle. 

a) Prove that the average number of comparisons made by the restricted uniform 

sorting algorithm is 2/(c< + 2). 

b) Use the results of (a) and exercise 64 to determine the average number of irredun- 
dant comparisons performed by quicksort. 

c) The following pair sequence is inspired by (but not equivalent to) merge sorting: 

(1,2) (3, 4) (5, 6) . . . (1,3) (1,4) (2, 3) (2, 4) (5, 7) . . . ( 1 , 5) (1, 6) ( 1, 7) ( 1 , 8) (2, 5) . . . 

Does the uniform method based on this sequence do more or fewer comparisons 
than quicksort, on the average? 

66. [M29] In the worst case, quicksort does (^) comparisons. Do all restricted 
uniform sorting algorithms (in the sense of exercise 64) perform (^) comparisons in 
their worst case? 

67. [Mf8] (H. L. Beus.) Does quicksort have the minimum average number of com- 
parisons, over all (restricted) uniform sorting algorithms? 

68. [25] The Ph.D. thesis “Electronic Data Sorting” by Howard B. Demuth (Stanford 
University, October 1956) was perhaps the first publication to deal in any detail with 
questions of computational complexity. Demuth considered several abstract models 
for sorting devices, and established lower and upper bounds on the mean and maxi- 
mum execution times achievable with each model. His simplest model, the “circular 
nonreversible memory” (Fig. 61), is the subject of this exercise. 



Fig. 61. A device for which the bubble-sort strategy is optimum. 




5.3.4 


NETWORKS FOR SORTING 247 


Consider a machine that sorts R\ R? . . . Rn in a number of passes, where each 
pass contains the following IV + 1 steps: 

Step 1. Set R -<— Ri. (R is an internal machine register.) 

Step i, for 1 < f < N. Either (i) set Ri-i t— R, R t— Ri, or (ii) set f?;_i <— Ri, 
leaving R unchanged. 

Step N + 1. Set Rn <— R. 

The problem is to find a way to choose between alternatives (i) and (ii) each time, in 
order to minimize the number of passes required to sort. 

Prove that the “bubble sort” technique is optimum for this model. In other words, 
show that the strategy that selects alternative (i) whenever R < Ri and alternative (ii) 
whenever R > Ri will achieve the minimum number of passes. 


They that weave networks shall be confounded. 

— Isaiah 19:9 


248 SORTING 


5.4 


5.4. EXTERNAL SORTING 

Now it IS TIME for us to study the interesting problems that arise when the 
number of records to be sorted is larger than our computer can hold in its 
high-speed internal memory. External sorting is quite different from internal 
sorting, even though the problem in both cases is to sort a given file into 
nondecreasing order, since efficient storage accessing on external files is rather 
severely limited. The data structures must be arranged so that comparatively 
slow peripheral memory devices (tapes, disks, drums, etc.) can quickly cope with 
the requirements of the sorting algorithm. Consequently most of the internal 
sorting techniques we have studied (insertion, exchange, selection) are virtually 
useless for external sorting, and it is necessary to reconsider the whole question. 

Suppose, for example, that we are supposed to sort a file of five million 
records Ri R 2 . . •fi’soooooo! and that each record Ri is 20 words long (although 
the keys Ki are not necessarily this long). If only one million of these records 
will fit in the internal memory of our computer at one time, what shall we do? 

One fairly obvious solution is to start by sorting each of the five subfiles 
Ri ■ ■ ■ -Kioooooo, -Rioooooi • • • ^2000000) • • • j ^4000001 ■ • ■ -R5000000 independently, 
then to merge the resulting subfiles together. Fortunately the process of merging 
uses only very simple data structures, namely linear lists that are traversed in 
a sequential manner as stacks or as queues; hence merging can be done without 
difficulty on the least expensive external memory devices. 

The process just described — internal sorting followed by external merging — 
is very commonly used, and we shall devote most of our study of external sorting 
to variations on this theme. 

The ascending sequences of records that are produced by the initial internal 
sorting phase are often called strings in the published literature about sorting; 
this terminology is fairly widespread, but it unfortunately conflicts with even 
more widespread usage in other branches of computer science, where “strings” 
are arbitrary sequences of symbols. Our study of permutations has already given 
us a perfectly good name for the sorted segments of a file, which are convention- 
ally called ascending runs or simply runs. Therefore we shall consistently use 
the word “runs” to describe sorted portions of a file. In this way it is possible to 
distinguish between “strings of runs” and “runs of strings” without ambiguity. 
(Of course, “runs of a program” means something else again; we can’t have 
everything.) 

Let us consider first the process of external sorting when magnetic tapes 
are used for auxiliary storage. Perhaps the simplest and most appealing way to 
merge with tapes is the balanced two-way merge following the central idea that 
was used in Algorithms 5 . 2 . 4 N, S, and L. We use four “working tapes” in this 
process. During the first phase, ascending runs produced by internal sorting are 
placed alternately on Tapes 1 and 2 , until the input is exhausted. Then Tapes 1 
and 2 are rewound to their beginnings, and we merge the runs from these tapes, 
obtaining new runs that are twice as long as the original ones; the new runs 
are written alternately on Tapes 3 and 4 as they are being formed. (If Tape 
1 contains one more run than Tape 2 , an extra “dummy” run of length 0 is 


5.4 


EXTERNAL SORTING 249 


assumed to be present on Tape 2.) Then all tapes are rewound, and the contents 
of Tapes 3 and 4 are merged into quadruple-length runs recorded alternately on 
Tapes 1 and 2. The process continues, doubling the length of runs each time, 
until only one run is left (namely the entire sorted file). If S runs were produced 
during the internal sorting phase, and if 2 fc_1 < S < 2 fc , this balanced two-way 
merge procedure makes exactly k — [ Ig S'] merging passes over all the data. 

For example, in the situation above where 5000000 records are to be sorted 
with an internal memory capacity of 1000000, we have 5 = 5. The initial 
distribution phase of the sorting process places five runs on tape as follows: 

Tape 1 #i . . . #1000000; #2000001 • • • #3000000; #4000001 

Tape 2 #1000001 • • • #2000000; #3000001 • • • #4000000- 

Tape 3 (empty) 

Tape 4 (empty) 

The first pass of merging then produces longer runs on Tapes 3 and 4, as it reads 
Tapes 1 and 2, as follows: 

Tape 3 #1 • . • #2000000; #4000001 • • • #5000000- , , 

( 2 ) 

Tape 4 #2000001 • • • #4000000- 

(A dummy run has implicitly been added at the end of Tape 2, so that the last 
run #4000001 • ■ • #5000000 on Tape 1 is merely copied onto Tape 3.) After all tapes 
are rewound, the next pass over the data produces 

Tape 1 #1 - . - #4000000- , . 

(3) 

Tape 2 #4000001 • • • #5000000- 

(Again that run #4000001 • • • #5000000 was simply copied; but if we had started 
with 8000000 records, Tape 2 would have contained #4000001 • • ■ #8000000 at this 
point.) Finally, after another spell of rewinding, Ri . . .#5000000 is produced on 
Tape 3, and the sorting is complete. 

Balanced merging can easily be generalized to the case of T tapes, for any 
T > 3. Choose any number P with 1 < P < T, and divide the T tapes into two 
“banks,” with P tapes on the left bank and T — P on the right. Distribute the 
initial runs as evenly as possible onto the P tapes in the left bank; then do a 
P- way merge from the left to the right, followed by a (T — P)- way merge from 
the right to the left, etc., until sorting is complete. The best choice of P usually 
turns out to be \T/ 2 \ (see exercises 3 and 4). 

Balanced two-way merging is the special case T = 4, P = 2. Let us 
reconsider the example above using more tapes, taking T = 6 and P — 3. The 
initial distribution now gives us 

Tape 1 #1 . . . #1000000; #3000001 ■ • ■ #4000000- 

Tape 2 #1000001 - ■ • #2000000; #4000001 • • • #5000000- (4) 

Tape 3 #2000001 • • • #3000000 • 


.#. 


5000000 • 


(l) 


250 SORTING 


5.4 


And the first merging pass produces 
Tape 4 Ri . ■ . R3000000 • 

Tape 5 R3000001 • ■ • Rsoooooo- (5) 

Tape 6 (empty) 

(A dummy run has been assumed on Tape 3.) The second merging pass completes 
the job, placing i?i . . . R5000000 on Tape 1. In this special case T = 6 is essentially 
the same as T = 5, since the sixth tape is used only when S >7. 

Three-way merging requires more computer processing than two-way merg- 
ing; but this is generally negligible compared to the cost of reading, writing, 
and rewinding the tapes. We can get a fairly good estimate of the running time 
by considering only the amount of tape motion. The example in (4) and (5) 
required only two passes over the data, compared to three passes when T — 4, 
so the merging takes only about two-thirds as long when T — 6. 

Balanced merging is quite simple, but if we look more closely, we find 
immediately that it isn’t the best way to handle the particular cases treated 
above. Instead of going from (1) to (2) and rewinding all of the tapes, we should 
have stopped the first merging pass after Tapes 3 and 4 contained R x . . . R2000000 
and R2000001 • • • R40000001 respectively, with Tape 1 poised ready to read the 
records R4000001 • ■ • Rsoooooo- Then Tapes 2 , 3 , 4 could be rewound and we could 
complete the sort by doing a three-way merge onto Tape 2 . The total number of 
records read from tape during this procedure would be only 4000000 + 5000000 = 
9000000, compared to 5000000 + 5000000 + 5000000 = 15000000 in the balanced 
scheme. A smart computer would be able to figure this out. 

Indeed, when we have five runs and four tapes we can do even better by 


distributing them 

as follows: 


Tape 1 

Ri - R1000000; R3000001 ■ ■ • R4000000 • 

Tape 2 

R1000001 • • 

• R2000000 ; R4000001 ■ • • R5000000 

Tape 3 

R2000001 ■ • 

• R3000000 ■ 

Tape 4 

(empty) 



Then a three-way merge to Tape 4, followed by a rewind of Tapes 3 and 4, 
followed by a three-way merge to Tape 3, would complete the sort with only 
3000000 + 5000000 = 8000000 records read. 

And, of course, if we had six tapes we could put the initial runs on Tapes 1 
through 5 and complete the sort in one pass by doing a five- way merge to Tape 6. 
These considerations indicate that simple balanced merging isn’t the best, and 
it is interesting to look for improved merging patterns. 

Subsequent portions of this chapter investigate external sorting more deeply. 
In Section 5.4.1, we will consider the internal sorting phase that produces the 
initial runs; of particular interest is the technique of “replacement selection,” 
which takes advantage of the order present in most data to produce long initial 
runs that actually exceed the internal memory capacity by a significant amount. 
Section 5.4.1 also discusses a suitable data structure for multiway merging. 


5.4 


EXTERNAL SORTING 251 


The most important merging patterns are discussed in Sections 5.4.2 through 
5.4.5. It is convenient to have a rather naive conception of tape sorting as we 
learn the characteristics of these patterns, before we come to grips with the 
harsh realities of real tape drives and real data to be sorted. For example, we 
may blithely assume (as we did above) that the original input records appear 
magically during the initial distribution phase; in fact, these input records might 
well occupy one of our tapes, and they may even fill several tape reels since 
tapes aren’t of infinite length! It is best to ignore such mundane considerations 
until after an academic understanding of the classical merging patterns has been 
gained. Then Section 5.4.6 brings the discussion down to earth by discussing 
real-life constraints that strongly influence the choice of a pattern. Section 5.4.6 
compares the basic merging patterns of Sections 5.4.2 through 5.4.5, using a 
variety of assumptions that arise in practice. 

Some other approaches to external sorting, not based on merging, are dis- 
cussed in Sections 5.4.7 and 5.4.8. Finally Section 5.4.9 completes our survey of 
external sorting by treating the important problem of sorting on bulk memories 
such as disks and drums. 

When this book was first written, magnetic tapes were abundant and disk 
drives were expensive. But disks became enormously better during the 1980s, 
and by the late 1990s they had almost completely replaced magnetic tape units 
on most of the world’s computer systems. Therefore the once-crucial topic of 
patterns for tape merging has become of limited relevance to current needs. 

Yet many of the patterns are quite beautiful, and the associated algorithms 
reflect some of the best research done in computer science during its early years; 
the techniques are just too nice to be discarded abruptly onto the rubbish heap 
of history. Indeed, the ways in which these methods blend theory with practice 
are especially instructive. Therefore merging patterns are discussed carefully 
and completely below, in what may be their last grand appearance before they 
accept a final curtain call. 

For all we know now, 
these techniques may well become crucial once again. 

— PAVEL CURTIS (1997) 

EXERCISES 

1. [15] The text suggests internal sorting first, followed by external merging. Why 
don’t we do away with the internal sorting phase, simply merging the records into 
longer and longer runs right from the start? 

2. [10] What will the sequence of tape contents be, analogous to (l) through (3), 
when the example records R\ R 2 ■ ■ • P5000000 are sorted using a 3-tape balanced method 
with P = 2? Compare this to the 4-tape merge; how many passes are made over all 
the data, after the initial distribution of runs? 

3. [20] Show that the balanced (P, T-P )- way merge applied to S initial runs takes 

2k passes, when P k (T - P) fc_1 < S < P k (T - P) k ; and it takes 2k + 1 passes, when 
P k (T _ < 5 < P k+ i(T - P) k . 

Give simple formulas for (a) the exact number of passes, as a function of S, when 
T = 2 P; and (b) the approximate number of passes, as S — t 00, for general P and T. 

4. [HM15] What value of P, for 1 < P < T, makes P(T - P) a maximum? 


252 


SORTING 


5.4.1 


5.4.1. Multiway Merging and Replacement Selection 

In Section 5.2.4, we studied internal sorting methods based on two-way merging, 
the process of combining two ordered sequences into a single ordered sequence. 
It is not difficult to extend this to the notion of P - way merging, where P runs 
of input are combined into a single run of output. 

Let’s assume that we have been given P ascending runs, that is, sequences 
of records whose keys are in nondecreasing order. The obvious way to merge 
them is to look at the first record of each run and to select the record whose 
key is smallest; this record is transferred to the output and removed from the 
input, and the process is repeated. At any given time we need to look at only P 
keys (one from each input run) and select the smallest. If two or more keys are 
smallest, an arbitrary one is selected. 

When P isn’t too large, it is convenient to make this selection by simply 
doing P ~ 1 comparisons to find the smallest of the current keys. But when 
P is, say, 8 or more, we can save work by using a selection tree as described in 
Section 5.2.3; then only about lg P comparisons are needed each time, once the 
tree has been set up. 

Consider, for example, the case of four-way merging, with a two-level selec- 
tion tree: 


Step 1. 


Step 2. 


Step 3. 


Step 9. 


087 


087 154 


087 154 170 


087 

154 

170 

154 

170 

426 


087 154 170 426 503 612 653 908 oo l 


oo 


087 

503 

oo 

170 

908 

oo 

154 

426 

653 

612 

oo 


503 

oo 


170 

908 

oo 

154 

426 

653 

612 

oo 


503 

oo 


170 

908 

oo 

426 

653 

oo 

612 

oo 


oo 



oo 



oo 



oo 




CO 


CO 


An additional key “oo” has been placed at the end of each run in this example, 
so that the merging terminates gracefully. Since external merging generally 
deals with very long runs, the addition of records with oo keys does not add 
substantially to the length of the data or to the amount of work involved in 
merging, and such sentinel records frequently serve as a useful way to delimit 
the runs on a file. 


5.4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 253 



Fig. 62. A tournament to select the smallest key, using a complete binary tree 
whose nodes are numbered from 1 to 23. 

Each step after the first in this process consists of replacing the smallest 
element by the succeeding element in its run, and changing the corresponding 
path in the selection tree. Thus the three positions of the tree that contain 087 
in Step 1 are changed in Step 2; the three positions containing 154 in Step 2 are 
changed in Step 3; and so on. The process of replacing one key by another in 
the selection tree is called replacement selection. 

We can look at this four-way merge in several ways. From one standpoint it 
is equivalent to three two-way merges performed concurrently as coroutines; each 
node in the selection tree represents one of the sequences involved in concurrent 
merging processes. The selection tree is also essentially operating as a priority 
queue, with a smallest-in-first-out discipline. 

As in Section 5.2.3 we could implement the priority queue by using a heap 
instead of a selection tree. (The heap would, of course, be arranged so that the 
smallest element appears at the top, instead of the largest, reversing the order of 
Eq. 5.2.3-(3).) Since a heap does not have a fixed size, we could therefore avoid 
the use of oo keys; merging would be complete when the heap becomes empty. 
On the other hand, external sorting applications usually deal with comparatively 
long records and keys, so that the heap is filled with pointers to keys instead of 
the keys themselves; we shall see below that selection trees can be represented by 
pointers in such a convenient manner that they are probably superior to heaps 
in this situation. 

A tree of losers. Figure 62 shows the complete binary tree with 12 external 
(rectangular) nodes and 11 internal (circular) nodes. The external nodes have 
been filled with keys, and the internal nodes have been filled with the “winners,” 
if the tree is regarded as a tournament to select the smallest key. The smaller 
numbers above each node show the traditional way to allocate consecutive stor- 
age positions for complete binary trees. 


254 SORTING 


5.4.1 



Fig. 63. The same tournament as Fig. 62, but showing the losers instead of the 
winners; the champion appears at the very top. 

When the smallest key, 061, is to be replaced by another key in the selection 
tree of Fig. 62, we will have to look at the keys 512, 087, and 154, and no 
other existing keys, in order to determine the new state of the selection tree. 
Considering the tree as a tournament, these three keys are the losers in the 
matches played by 061. This suggests that the loser of a match should actually 
be stored in each internal node of the tree, instead of the winner; then the 
information required for updating the tree will be readily available. 

Figure 63 shows the same tree as Fig. 62, but with the losers represented 
instead of the winners. An extra node number 0 has been appended at the top 
of the tree, to indicate the champion of the tournament. Each key except the 
champion is a loser exactly once (see Section 5.3.3), so each key appears just 
once in an external node and once in an internal node. 

In practice, the external nodes at the bottom of Fig. 63 will represent fairly 
long records stored in computer memory, and the internal nodes will represent 
pointers to those records. Note that P- way merging calls for exactly P external 
nodes and P internal nodes, each in consecutive positions of memory, hence 
several efficient methods of storage allocation suggest themselves. It is not 
difficult to see how to use a loser-oriented tree for replacement selection; we 
shall discuss the details later. 

Initial runs by replacement selection. The technique of replacement se- 
lection can be used also in the first phase of external sorting, if we essentially 
do a P- way merge of the input data with itself! In this case we take P to be 
quite large, so that the internal memory is essentially filled. When a record is 
output, it is replaced by the next record from the input. If the new record has a 
smaller key than the one just output, we cannot include it in the current run; but 




5.4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 


255 


Table 1 

EXAMPLE OF FOUR-WAY REPLACEMENT SELECTION 
Memory contents Output 


503 

087 

512 

061 

061 

503 

087 

512 

908 

087 

503 

170 

512 

908 

170 

503 

897 

512 

908 

503 

(275) 

897 

512 

908 

512 

(275) 

897 

653 

908 

653 

(275) 

897 

(426) 

908 

897 

(275) 

(154) 

(426) 

908 

908 

(275) 

(154) 

(426) 

(509) 

(end of run) 

275 

154 

426 

509 

154 

275 

612 

426 

509 

275 


etc. 


otherwise we can enter it into the selection tree in the usual way and it will form 
part of the run currently being produced. Thus the runs can contain more than 
P records each, even though we never have more than P in the selection tree at 
any time. Table 1 illustrates this process for P — 4; parenthesized numbers are 
waiting for inclusion in the following run. 

This important method of forming initial runs was first described by Har- 
old H. Seward [Master’s Thesis, Digital Computer Laboratory Report R-232 
(Mass. Inst, of Technology, 1954), 29-30], who gave reason to believe that the 
runs would contain more than 1.5P records when applied to random data. A. I. 
Dumey had also suggested the idea about 1950 in connection with a special sort- 
ing device planned by Engineering Research Associates, but he did not publish it. 
The name “replacement selecting” was coined by E. H. Friend [JACM 3 (1956), 
154], who remarked that “the expected length of the sequences produced eludes 
formulation but experiment suggests that 2 P is a reasonable expectation.” 

A clever way to show that 2 P is indeed the expected run length was discov- 
ered by E. F. Moore, who compared the situation to a snowplow on a circular 
track [U.S. Patent 2983904 (1961), columns 3-4]. Consider the situation shown 
in Fig. 64: Flakes of snow are falling uniformly on a circular road, and a lone 
snowplow is continually clearing the snow. Once the snow has been plowed off 
the road, it disappears from the system. Points on the road may be designated by 
real numbers x, 0 < x < 1; a flake of snow falling at position x represents an input 
record whose key is x, and the snowplow represents the output of replacement 
selection. The ground speed of the snowplow is inversely proportional to the 
height of snow it encounters, and the situation is perfectly balanced so that the 
total amount of snow on the road at all times is exactly P. A new run is formed 
in the output whenever the plow passes point 0. 

After this system has been in operation for awhile, it is intuitively clear that 
it will approach a stable situation in which the snowplow runs at constant speed 
(because of the circular symmetry of the track). This means that the snow is at 


256 SORTING 


5.4.1 



Fig. 64. The perpetual plow on its ceaseless cycle. 

constant height when it meets the plow, and the height drops off linearly in front 
of the plow as shown in Fig. 65. It follows that the volume of snow removed in 
one revolution (namely the run length) is twice the amount present at any one 
time (namely P). 


Falling snow 



Fig. 65. Cross-section, showing the varying height of snow in front of the plow when 
the system is in its steady state. 

In many commercial applications the input data is not completely random; 
it already has a certain amount of existing order. Therefore the runs produced by 
replacement selection will tend to contain even more than 2 P records. We shall 
see that the time required for external merge sorting is largely governed by the 
number of runs produced by the initial distribution phase, so that replacement 
selection becomes especially desirable; other types of internal sorting would pro- 
duce about twice as many initial runs because of the limitations on memory size. 

Let us now consider the process of creating initial runs by replacement 
selection in detail. The following algorithm is due to John R. Walters, James 
Painter, and Martin Zalk, who used it in a merge-sort program for the Philco 
2000 in 1958. It incorporates a rather nice way to initialize the selection tree 
and to distinguish records belonging to different runs, as well as to flush out the 
last run, with comparatively simple and uniform logic. (The proper handling 
of the last run produced by replacement selection turns out to be a bit tricky, 


5.4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 257 



Fig. 66. Making initial runs by replacement selection. 

and it has tended to be a stumbling block for programmers.) The principal idea 
is to consider each key as a pair ( S,K ), where K is the original key and S is 
the run number to which this record belongs. When such extended keys are 
lexicographically ordered, with S as major key and K as minor key, we obtain 
the output sequence produced by replacement selection. 

The algorithm below uses a data structure containing P nodes to represent 
the selection tree; the jth node X[j] is assumed to contain c words beginning 
in L0C(A[j]) = Lq + cj, for 0 < j < P, and it represents both internal node 
number j and external node number P + j in Fig. 63. There are several named 
fields in each node: 

KEY = the key stored in this external node; 

RECORD = the record stored in this external node (including KEY as a subfield); 
LOSER = pointer to the “loser” stored in this internal node; 

RN = run number of the record stored in this external node; 

PE = pointer to internal node above this external node in the tree; 

PI = pointer to internal node above this internal node in the tree. 

For example, when P = 12, internal node number 5 and external node number 17 
of Fig. 63 would both be represented in X[5], by the fields KEY = 170, LOSER = 
Lq + 9c (the address of external node number 21), PE — Lq + 8c, PI = Lq + 2c. 

The PE and PI fields have constant values, so they need not appear explicitly 
in memory; however, the initial phase of external sorting sometimes has trouble 
keeping up with the I/O devices, and it might be worthwhile to store these 
redundant values with the data instead of recomputing them each time. 

Algorithm R ( Replacement selection). This algorithm reads records sequen- 
tially from an input file and writes them sequentially onto an output file, pro- 
ducing RMAX runs whose length is P or more (except for the final run). There 
are P > 2 nodes, A[0], . . . , X\P — 1], having fields as described above. 

Rl. [Initialize.] Set RMAX 4- 0, RC «- 0, LASTKEY <- oo, and Q «- L0C(X[0]). 
(Here RC is the number of the current run and LASTKEY is the key of the 








258 


SORTING 


5.4.1 


last record output. The initial setting of LASTKEY should be larger than any 
possible key; see exercise 8.) For 0 < j < P, set the initial contents of X[j] 
as follows: 


J 4— L0C(X[j]); LOSER(J) 4— J; RN(J) 4- 0; 

PE ( J) t- L0CCY[[(P + i)/2j])| PI (J) <-L0C(A'[Lj/2J]). 

(The settings of LOSER(J) and RN(J) are artificial ways to get the tree 
initialized by considering a fictitious run number 0 that is never output. 
This is tricky; see exercise 10.) 

R2. [End of run?] If RN(Q) = RC, go on to step R3. (Otherwise RN(Q) = RC + 1 
and we have just completed run number RC; any special actions required by 
a merging pattern for subsequent passes of the sort would be done at this 
point.) If RC = RMAX, stop; otherwise set RC 4— RC + 1. 

R3. [Output top of tree.] (Now Q points to the “champion,” and RN(Q) = RC.) 
If RC ^ 0, output RECORD (Q) and set LASTKEY 4- KEY (Q) . 

R4. [Input new record.] If the input file is exhausted, set RN(Q) 4- RMAX + 1 
and go on to step R5. Otherwise set RECORD (Q) to the next record from the 
input file. If KEY (Q) < LASTKEY (so that this new record does not belong to 
the current run), set RMAX 4- RN(Q) 4— RC + 1. 

R5. [Prepare to update.] (Now Q points to a new record.) Set T 4 - PE(Q). 
(Variable T is a pointer that will move up the tree.) 

R6. [Set new loser.] Set L 4- LOSER(T) . If RN(L) < RN(Q) or if RN(L) = RN(Q) 
and KEY (L) < KEY (Q) , then set LOSER(T) 4- Q and Q 4— L. (Variable Q 
keeps track of the current winner.) 

R7. [Move up.] If T = L0C(X[1]) then go back to R2, otherwise set T 4 - PI(T) 
and return to R.6. | 

Algorithm R speaks of input and output of records one at a time, while in 
practice it is best to read and write relatively large blocks of records. Therefore 
some input and output buffers are actually present in memory, behind the scenes, 
effectively lowering the size of P. We shall illustrate this in Section 5.4.6. 

*Delayed reconstitution of runs. A very interesting way to improve on 
replacement selection has been suggested by R. J. Dinsmore [CACM 8 (1965), 
48] using a concept that we shall call degrees of freedom. As we have seen, 
each block of records on tape within a run is in nondecreasing order, so that its 
first, element is the lowest and its last element is the highest. In the ordinary 
process of replacement selection, the lowest element of each block within a run 
is never less than the highest element of the preceding block in that run; this is 
“1 degree of freedom.” Dinsmore suggests relaxing this condition to “to degrees 
of freedom,” where the lowest element of each block may be less than the highest 
element of the preceding block so long as it is not less than the highest elements 
in m different preceding blocks of the same run. Records within individual blocks 
are ordered, as before, but adjacent blocks need not be in order. 


5.4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 259 


For example, suppose that there are just two records per block; the following 
sequence of blocks is a run with three degrees of freedom: 

| 08 50 | 06 90 | 17 27 | 42 67 | 51 89 | (i) 


A subsequent block that is to be part of the same run must begin with an 
element not less than the third largest element of {50, 90, 27, 67, 89}, namely 67. 
The sequence (i) would not be a run if there were only two degrees of freedom, 
since 17 is less than both 50 and 90. 

A run with in degrees of freedom can be “reconstituted” while it is being 
read during the next phase of sorting, so that for all practical purposes it is a run 
in the ordinary sense. We start by reading the first m blocks into in buffers, and 
doing an to- way merge on them; when one buffer is exhausted, we replace it with 
the (to. + l)st block, and so on. In this way we can recover the run as a single 
sequence, for the first word of every newly read block must be greater than or 
equal to the last word of the just-exhausted block (lest it be less than the highest 
elements in m different blocks that precede it). This method of reconstituting 
the run is essentially like an m- way merge using a single tape unit for all the 
input blocks! The reconstitution procedure acts as a coroutine that is called 
upon to deliver one record of the run at a time. We could be reconstituting 
different runs from different tape units with different degrees of freedom, and 
merging the resulting runs, all at the same time, in essentially the same way as 
the four-way merge illustrated at the beginning of this section may be thought 
of as several two-way merges going on at once. 

This ingenious idea is difficult to analyze precisely, but T. O. Espelid has 
shown how to extend the snowplow analogy to obtain an approximate formula 
for the behavior [BIT 16 (1976), 133-142]. According to his approximation, 
which agrees well with empirical tests, the run length will be about 


2 P + ( to 


1.5) 


f 2P+(m-2)b\ 
\2P + (2m - 3)b ) b ’ 


when b is the block size and m > 2. Such an increase may not be enough to 
justify the added complication; on the other hand, it may be advantageous when 
there is room for a rather large number of buffers during the second phase of 
sorting. 


^Natural selection. Another way to increase the run lengths produced by 
replacement selection has been explored by W. D. Frazer and C. K. Wong [CACM 
15 (1972), 910-913]. Their idea is to proceed as in Algorithm R, except that 
a new record is not placed in the tree when its key is less than LASTKEY; it is 
output into an external reservoir instead, and another new record is read in. This 
process continues until the reservoir is filled with a certain number of records, P'\ 
then the remainder of the current run is output from the tree, and the reservoir 
items are used as input for the next run. 

The use of a reservoir tends to produce longer runs than replacement selec- 
tion, because it reroutes the “dead” records that belong to the next run instead 
of letting them clutter up the tree; but it requires extra time for input and output 


260 SORTING 


5.4.1 


L — x -»| 

- ..... Input snow 



to and from the reservoir. When P' > P it is possible that some records will be 
placed into the reservoir twice, but when P' < P this will never happen. 

Frazer and Wong made extensive empirical tests of their method, noticing 
that when P is reasonably large (say P > 32) and P' = P the average run 
length for random data is approximately given by eP, where e « 2.718 is the 
base of natural logarithms. This phenomenon, and the fact that the method 
is an evolutionary improvement over simple replacement selection, naturally led 
them to call their method natural selection. 

The “natural” law for run lengths can be proved by considering the snowplow 
of Fig. 64 again, and applying elementary calculus. Let L be the length of the 
track, and let x(t) be the position of the snowplow at time t, for 0 < t < T. 
The reservoir is assumed to be full at time T, when the snow stops temporarily 
while the plow returns to its starting position (clearing the P units of snow 
remaining in its path). The situation is the same as before except that the 
“balance condition” is different; instead of P units of snow on the road at all 
times, we have P units of snow in front of the plow, and the reservoir (behind 
the plow) gets up to P' = P units. The snowplow advances by dx during a 
time interval dt if h(x,t)dx records are output, where h(x,t) is the height of 
the snow at time t and position x — x(t), measured in suitable units; hence 
h(x,t) = h(x, 0) + Kt for all x , where K is the rate of snowfall. Since the 
number of records in memory stays constant, h(x,t)dx is also the number of 
records that are input ahead of the plow, namely K dt(L - x) (see Fig. 67). 
Thus 

dx K(L — x) 

dt h(x,t) ' 

Fortunately, it turns out that h(x, t ) is constant, equal to KT, whenever x = x(t) 
and 0 < t. < T, since the snow falls steadily at position x{t) for T-t units of time 
after the plow passes that point, plus t units of time before it comes back. In 
other words, the plow sees all snow at the same height on its journey, assuming 
that a steady state has been reached where each journey is the same. Hence 
the total amount of snow cleared (the run length) is LKT] and the amount of 
snow in memory is the amount cleared after time T, namely A'T(L-x(T)). The 
solution to ( 2 ) such that x(0) = 0 is 

x{t) = L{ 1 - e“ t/T ); 

hence P = LKTe -1 = (run length)/e; and this is what we set out to prove. 


( 3 ) 


5.4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 261 


Exercises 21 through 23 show that this analysis can be extended to the case 
of general P'\ for example, when P' ~ 2 P the average run length turns out to 
be e e (e — 0)P , where 6 = (e — yj e 2 — 4) / 2, a result that probably wouldn’t have 
been guessed offhand! Table 2 shows the dependence of run length on reservoir 
size; the usefulness of natural selection in a given computer environment can be 
estimated by referring to this table. The table entries for reservoir size < P use 
an improved technique that is discussed in exercise 27. 

The ideas of delayed run reconstitution and natural selection can be com- 
bined, as discussed by T. C. Ting and Y. W. Wang in Comp. J. 20 (1977), 
298-301. 


Table 2 

RUN LENGTHS BY NATURAL SELECTION 


Reservoir size 

Run length 

k + e 

Reservoir size 

Run length 

k + e 

0.10000P 

2.15780P 

0.32071 

0.00000P 

2.00000 P 

0.00000 

0.50000P 

2.54658P 

0.69952 

0.43428P 

2.50000P 

0.65348 

1.00000P 

2.71828P 

1.00000 

1.30432P 

3.00000P 

1.15881 

2.00000P 

3.53487 P 

1.43867 

1.95014P 

3.50000P 

1.42106 

3.00000P 

4.16220P 

1.74773 

2.72294P 

4.00000P 

1.66862 

4.00000P 

4.69446P 

2.01212 

4.63853P 

5.00000P 

2.16714 

5.00000P 

5.16369P 

2.24938 

21.72222P 

10.00000P 

4.66667 

10.00000P 

7.00877P 

3.17122 

5.29143P 

5.29143P 

2.31329 

The quantity k + 6 

is defined in 

exercise 22, 

or (when k = 0) 

in exercise 27. 



*Analysis of replacement selection. Let us now return to the case of replace- 
ment selection without an auxiliary reservoir. The snowplow analogy gives us 
a fairly good indication of the average length of runs obtained by replacement 
selection in the steady-state limit, but it is possible to get much more precise 
information about Algorithm R by applying the facts about runs in permutations 
that we have studied in Section 5.1.3. For this purpose it is convenient to assume 
that the input file is an arbitrarily long sequence of independent random real 
numbers between 0 and 1. 

Let 

9p(z\-$%2i • • • j %k) ^ ' dp (l i , ^2 , • • • i^k) ^2 • • * Zjf 

/l ,/2 v >0 

be the generating function for run lengths produced by P- way replacement 
selection on such a file, where d P (li,l 2 , ■ ■ • ,l k ) is the probability that the first 
run has length 1 1 , the second has length l 2 , .... the fcth has length l k . The 
following “independence theorem” is basic, since it reduces the analysis to the 
case P = 1: 

Theorem K. g P (zi, z 2 , . . . , z k ) = gi(z u z 2 , . . . , z k ) p . 

Proof. Let the input keys be X\,X 2 , X :i , . . . . Algorithm R partitions them into 
P subsequences, according to which external node position they occupy in the 


262 SORTING 


5.4.1 


tree; the subsequence containing X n is determined by the values of Xi , . . . , X n _j. 
Each of these subsequences is therefore an independent sequence of independent 
random numbers between 0 and 1. Furthermore, the output of replacement 
selection is precisely what would be obtained by doing a P- way merge on these 
subsequences; an element belongs to the jth run of a subsequence if and only if 
it belongs to the jth run produced by replacement selection (since LASTKEY and 
KEY(Q) belong to the same subsequence in step R4). 

In other words, we might just as well assume that Algorithm R is being 
applied to P independent random input files, and that step R4 reads the next 
record from the file corresponding to external node Q; in this sense, the algorithm 
is equivalent to a P- way merge, with “stepdowns” marking the ends of the runs. 

Thus the output has runs of lengths if and only if the sub- 

sequences have runs of respective lengths (l n , . . . , l lk ), . . . , (l PU . . . , l Pk ) y where 
the lij are some nonnegative integers satisfying J2i<i<p l ij = h for 1 < j < k. 
It follows that 

a p(h, ■ ■ h) = a i(hi, ■ ■ ■ ,hk) ■ ■ ■ o,i(lpi, ■ ■ • , Ipk), 

111-) Vlpi—ll 

hk~\ Y-lpk—lk 

and this is equivalent to the desired result. | 

We have discussed the average length L k of the fcth run, when P = 1, 
in Section 5.1.3, where the values are tabulated in Table 5. 1.3-2. Theorem K 
implies that the average length of the fcth run for general P is P times as long 
as the average when P — 1, namely L k P\ and the variance is also P times as 
large, so the standard deviation of the run length is proportional to \[P. These 
results were first derived by B. J. Gassner about 1958. 

Thus the first run produced by Algorithm R will be about (e- 1 )P ss 1.718P 
records long, for random data; the second run will be about (e 2 -2 e)P « 1.952P 
records long; the third, about 1.996P; and subsequent runs will be very close 
to 2 P records long until we get to the last two runs (see exercise 14). The 
standard deviation of most of these run lengths is approximately J(4e - 10)P « 
0.934 \fp [ CACM 6 (1963), 685-688], Furthermore, exercise 5.1.3-10 shows that 
the total length of the first fc runs will be fairly close to (2k - |)P, with a 
standard deviation of ((|fc+ DP) 1 ^ 2 . The generating functions gi(z, z , . , . , z) 
and < 7 i(l, . . . , 1, z) are derived in exercises 5. 1.3-9 and 11. 

The analysis above has assumed that the input file is infinitely long, but 
the proof of Theorem K shows that the same probability a p (l 1} , . . , l k ) would 

be obtained in any random input sequence containing at least li 4 f Z fc + P 

elements. So the results above are applicable for, say, files of size N > (2fc + 1)P, 
in view of the small standard deviation. 

We will be seeing some applications in which the merging pattern wants 
some of the runs to be ascending and some to be descending. Since the residue 
accumulated in memory at the end of an ascending run tends to contain numbers 
somewhat smaller on the average than random data, a change in the direction 


5.4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 263 


of ordering decreases the average length of the runs. Consider, for example, a 
snowplow that must make a U-turn every time it reaches an end of a straight 
road; it will go very speedily over the area just plowed. The run lengths 
when directions are reversed vary between 1.5P and 2 P for random data (see 
exercise 24). 

EXERCISES 

1. [10] What is Step 4, in the example of four- way merging at the beginning of this 
section? 

2 . [12] What changes would be made to the tree of Fig. 63 if the key 061 were 
replaced by 612? 

3. [16] (E. F. Moore.) What output is produced by four-way replacement selection 
when it is applied to successive words of the following sentence: 

fourscore and seven years ago our fathers brought forth 
on this continent a new nation conceived in liberty and 
dedicated to the proposition that all men are created equal. 

(Use ordinary alphabetic order, treating each word as one key.) 

4 . [16'] Apply four-way natural selection to the sentence in exercise 3, using a reser- 
voir of capacity 4. 

5. [00] True or false: Replacement selection using a tree works only when P is a 
power of 2 or the sum of two powers of 2. 

6. [15] Algorithm R specifies that P must be > 2; what comparatively small changes 
to the algorithm would make it valid for all P > 1? 

7 . [17] What does Algorithm R do when there is no input at all? 

8. [20] Algorithm R makes use of an artificial key “oo” that must be larger than 

any possible key. Show that the algorithm might fail if an actual key were equal to oo, 

and explain how to modify the algorithm in case the implementation of a true oo is 

inconvenient. 

► 9. [23] How would you modify Algorithm R so that it causes certain specified runs 
(depending on RC) to be output in ascending order, and others in descending order? 

10 . [26] The initial setting of the LOSER pointers in step R1 usually doesn’t correspond 
to any actual tournament, since external node P + j may not lie in the subtree below 
internal node j . Explain why Algorithm R works anyway. [Hint: Would the algorithm 
work if {L0SER(L0C(A[0]) ),..., L0SER(L0C(A'[P — 1]))} were set to an arbitrary per- 
mutation of {L0C(X[0]) , . . . , L0C(A'[P — 1]) } in step Rl?] 

11 . [M20] True or false: The probability that KEY(Q) < LASTKEY in step R4 is 
approximately 50%, assuming random input. 

12 . [M 46 ] Carry out a detailed analysis of the number of times each portion of 
Algorithm R is executed; for example, how often does step R6 set LOSER «— Q ? 

13 . [13] Why is the second run produced by replacement selection usually longer than 
the first run? 

► 14 . [HM25] Use the snowplow analogy to estimate the average length of the last two 
runs produced by replacement selection on a long sequence of input data. 


264 SORTING 


5.4.1 


15. [20] True or false: The final run produced by replacement selection never contains 
more than P records. Discuss your answer. 

16. [M26] Find a “simple” necessary and sufficient condition that a file R± R 2 . . . Rn 
will be completely sorted in one pass by P-way replacement selection. What is the 
probability that this happens, as a function of P and N, when the input is a random 
permutation of {1,2,..., A^}? 

17. [20] What is output by Algorithm R when the input keys are in decreasing order, 
Ki K 2 P ■ • ■ > Rjv? 

► 18. [22] What happens if Algorithm R is applied again to an output file that was 
produced by Algorithm R? 

19. [HM22] Use the snowplow analogy to prove that the first run produced by re- 
placement selection is approximately (e — 1 ) P records long. 

20. [HM24] Approximately how long is the first run produced by natural selection, 
when P = P"! 

► 21. [HM23] Determine the approximate length of runs produced by natural selection 
when P' < P. 

22. [HM40] The purpose of this exercise is to determine the average run length 
obtained in natural selection, when P' > P. Let k, = k + 6 be a real number > 1. 
where k = [kJ and 8 = k mod 1, and consider the function F(n) = Fk(0), where Fk{0) 
is the polynomial defined by the generating function 

J2 F k(0)z k =e- ez l(l-ze 1 ~ z ). 
k> 0 

Thus, Fo(0) = 1, Fi (8) = e — 8, F 2 (8 ) = e 2 — e — ed + \8 2 , etc. 

Suppose that a snowplow starts out at time 0 to simulate the process of natural 
selection, and suppose that after T units of time exactly P snowflakes have fallen behind 
it. At this point a second snowplow begins on the same journey, occupying the same 
position at time ( + Tas the first snowplow did at time t. Finally, at time kT, exactly 
P' snowflakes have fallen behind the first snowplow; it instantaneously plows the rest 
of the road and disappears. 

Using this model to represent the process of natural selection, show that a run 
length equal to e e F( k)P is obtained when 

P'/P = k + 1 + e e ( kF(k) — F[ k ~ j ) 

V j = 0 

23. [HM35] The preceding exercise analyzes natural selection when the records from 
the reservoir are always read in the same order as they were written, first-in-first- 
out. Find the approximate run length that would be obtained if the reservoir contents 
from the preceding run were read in completely random order, as if the records in the 
reservoir had been thoroughly shuffled between runs. 

24. [HM39] The purpose of this exercise is to analyze the effect caused by haphazardly 
changing the direction of runs in replacement selection. 

a) Let gp(zi,z 2 , . . . , Zk) be a generating function defined as in Theorem K, but with 
each of the k runs specified as to whether it is to be ascending or descending. 


5 . 4.1 


MULTIWAY MERGING AND REPLACEMENT SELECTION 265 


For example, we might say that all odd-numbered runs are ascending, all even- 
numbered runs are descending. Show that Theorem K is valid for each of the 2 fc 
generating functions of this type. 


b) 


As a consequence of (a), we may assume that P = 1. We may also assume that the 
input is a uniformly distributed sequence of independent random numbers between 
0 and 1. Let 


a(x,y) 


if x <y; 
if x > y. 


Given that f(x) dx is the probability that a certain ascending run begins with x , 
prove that (([' a(x,y)f(x)dx)dy is the probability that the following run begins 
with y. [Hint: Consider, for each n > 0, the probability that x < Xi < • ■ • < 
X n > y, when x and y are given.] 


c) Consider runs that change direction with probability p; in other words, the direc- 
tion of each run after the first is randomly chosen to be the same as that of the 
previous run, q = (1 — p) of the time, but it is to be in the opposite direction p of 
the time. (Thus when p = 0, all runs have the same direction; when p = 1, the 
runs alternate in direction; and when p = |, the runs are independently random.) 
Let 


■/ 


/i(x) = l, /n+i(y) =P a(x,y)fn(l -x)dx + q / a(x,y)f n (x)dx. 


f' 

Jo 


Show that the probability that the nth run begins with x is f n (x)dx when the 
(n — l)st run is ascending, / n ( 1 — x) dx when the (n — l)st run is descending. 

d) Find a solution / to the steady-state equations 


f(y)=p[ a(x,y)f(l-x)dx + q[ a(x,y)f(x) 
Jo Jo 


dx, 



[Hint: Show that f"(x) is independent of x.] 

e) Show that the sequence f n (x ) in part (c) converges rather rapidly to the function 
f(x) in part (d). 

f) Show that the average length of an ascending run starting with x is e 1 ~ x . 

g) Finally, put all these results together to prove the following theorem: If the 
directions of consecutive runs are independently reversed with probability p in 
replacement selection, the average run length approaches (6/(3 + p))P. 

(The case p = 1 of this theorem was first derived by Knuth [CACM 6 (1963), 
685-688]; the case p = | was first proved by A. G. Konheim in 1970.) 

25. [HM40] Consider the following procedure: 

Nl. Read a record into a one-word “reservoir.” Then read another record, R, and 
let K be its key. 

N2. Output the reservoir, set LASTKEY to its key, and set the reservoir empty. 
N3. If K < LASTKEY then output R and set LASTKEY <— K and go to N5. 

N4. If the reservoir is nonempty, return to N2; otherwise enter R into the reservoir. 
N5. Read in a new record, R, and let K be its key. Go to N3. | 

This is essentially equivalent to natural selection with P = 1 and with P' = 1 or 2 
(depending on whether you choose to empty the reservoir at the moment it fills or at 


266 SORTING 


5.4.1 


the moment it is about to overfill), except that it produces descending runs, and it 
never stops. The latter anomalies are convenient and harmless assumptions for the 
purposes of this problem. 

Proceeding as in exercise 24, let f n (x,y)dydx be the probability that x and y are 
the respective values of LASTKEY and K just after the nth time step N2 is performed. 
Prove that there is a function g n (x) of one variable such that f n (x,y) = g n (x) when 
x < y, and f n (x,y) = g n (x) — e y (g n (x) — g n {y)) when x > y. This function g n ( x) is 
defined by the relations gi(x) = 1, 


9n+i{x) J e g n (u) du + J dv (v 1) J , du ((e v — l)g n (u) + g n (v)) 

+ x f dv f du((e v - l)g n (u) + g n (v)). 

J X j V 

Show further that the expected length of the nth run is 

[ dx f dy (gn(x)(e y — 1) -f g n (y))(2 — |r/ 2 ) + f dx (1 - x)g n {x)e x . 

Jo Jo J 0 

[Note. The steady-state solution to these equations appears to be very complicated; 
it has been obtained numerically by J. McKenna, who showed that the run lengths 
approach a limiting value « 2.61307209. Theorem K does not apply to natural selection, 
so the case P = 1 does not carry over to other P.] 

26. [M33] Considering the algorithm in exercise 25 as a definition of natural selection 
when P' = 1, find the expected length of the first run when P' = r, for any r > 0 as 
follows. 

a) Show that the first run has length n with probability 

f n + r ' 
n 


(n + r) 


j(n + r + 1)!. 


b) Define “associated Stirling numbers” j[^]] by the rules 
Oil IT n 

— Om0 1 

LraJJ LLra 


= 5 ; 
Prove that 


(n + m — 1) 


n — 1 
m 


+ 


n — 1 
m — 1 


n + r 
n 


for n > 0. 


E C n + r\ ITrl" 

\k + r) [ _k\_ 

fc = U 

c) Prove that the average length of the first run is therefore c r e - r - 1, where 


c r 


r 

E 



r + k + 1 
(r + k)\ 


► 27. [HM30] (W. Dobosiewicz.) When natural selection is used with P' < P, we need 
not stop forming a run when the reservoir becomes full; we can store records that do 
not belong to the current run in the main priority queue, as in replacement selection, 
until only P' records of the current run are left. Then we can flush them to the output 
and replace them with the reservoir contents. 

How much better is this method than the simpler approach analyzed in exercise 21? 
28. [25] The text considers only the case that all records to be sorted have a fixed size. 
How can replacement selection be done reasonably well on variable-length records? 


5.4.2 


THE POLYPHASE MERGE 267 


29. [22] Consider the 2 k nodes of a complete binary tree that has been right-threaded, 
illustrated here when k = 3: 



(Compare with 2.3.1-(io); the top node is the list head, and the dotted lines are thread 
links. In this exercise we are not concerned with sorting but rather with the structure 
of complete binary trees when a list-head-like node 0 has been added above node 1, as 
in the “tree of losers,” Fig. 63.) 

Show how to assign the 2 n+k internal nodes of a large tree of losers onto these 
2 k host nodes so that (i) every host node holds exactly 2” nodes of the large tree; 
(ii) adjacent nodes in the large tree either are assigned to the same host node or to 
host nodes that are adjacent (linked); and (iii) no two pairs of adjacent nodes in the 
large tree are separated by the same link in the host tree. [Multiple virtual processors 
in a large binary tree network can thereby be mapped to actual processors without 
undue congestion in the communication links.] 

30. [M29] Prove that if n > k > 1, the construction in the preceding exercise is 
optimum, in the sense that any 2 fc -node host graph satisfying (i), (ii), and (iii) must 
have at least 2 k + 2 k -1 — 1 edges (links) between nodes. 

*5.4.2. The Polyphase Merge 

Now that we have seen how initial runs can be built up, we shall consider various 
patterns that can be used to distribute them onto tapes and to merge them 
together until only a single run remains. 

Let us begin by assuming that there are three tape units, Tl, T2, and T3, 
available; the technique of “balanced merging,” described near the beginning of 
Section 5.4, can be used with P — 2 and T = 3, when it takes the following form; 

Bl. Distribute initial runs alternately on tapes Tl and T2. 

B2. Merge runs from Tl and T2 onto T3; then stop if T3 contains only one run. 
B3. Copy the runs of T3 alternately onto Tl and T2, then return to B2. | 

If the initial distribution pass produces 5 runs, the first merge pass will produce 
[5/2] runs on T3, the second will produce [5/4], etc. Thus if, say, 17 < 5 < 32, 
we will have 1 distribution pass, 5 merge passes, and 4 copy passes; in general, 
if 5 > 1, the number of passes over all the data is 2[lg5]. 

The copying passes in this procedure are undesirable, since they do not 
reduce the number of runs. Half of the copying can be avoided if we use a 
two-phase procedure: 

Al. Distribute initial runs alternately on tapes Tl and T2. 

A2. Merge runs from Tl and T2 onto T3; then stop if T3 contains only one run. 
A3. Copy half of the runs from T3 onto Tl. 

A4. Merge r ims from Tl and T3 onto T2; then stop if T2 contains only one run. 
A5. Copy half of the runs from T2 onto Tl. Return to A2. | 


268 SORTING 


5.4.2 


The number of passes over the data has been reduced to § fig 5] + | , since steps 
A3 and A5 do only “half a pass”; about 25 percent of the time has therefore 
been saved. 

The copying can actually be eliminated entirely, if we start with F n runs 
on T1 and F n _ 1 runs on T2, where F n and F n _j are consecutive Fibonacci 
numbers. Consider, for example, the case n — 7, S = F n + F n _! = 13 + 8 = 21: 


Phase 

1 

2 

3 

4 

5 

6 
7 


Contents of T1 Contents of T2 


1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 . 1 , 1,14 

14444 

5,5,5 

5 

21 


1 , 1 , 1,14444 

3, 3, 3, 3, 3 
3,3 

13 


Contents of T3 Remarks 


2 , 2 , 2 , 2 , 2 , 2 , 2, 2 

2,2,2 

8,8 

8 


Initial distribution 
Merge 8 runs to T3 
Merge 5 runs to T2 
Merge 3 runs to T1 
Merge 2 runs to T3 
Merge 1 run to T2 
Merge 1 run to T1 


Here, for example, “2, 2, 2, 2, 2, 2, 2, 2” denotes eight runs of relative length 2, con- 
sidering each initial run to be of relative length 1. Fibonacci numbers are 
omnipresent in this chart! 

Only phases 1 and 7 are complete passes over the data; phase 2 processes 
only 16/21 of the initial runs, phase 3 only 15/21, etc., and so the total number 
of passes comes to (21 + 16 + 15 + 15 + 16 + 13 + 21)/21 = 5| if we assume 
that the initial runs have approximately equal length. By comparison, the two- 
phase procedure above would have required 8 passes to sort these 21 initial runs. 
We shall see that in general this “Fibonacci” pattern requires approximately 
1.04 lg S' + 0.99 passes, making it competitive with a four - tape balanced merge 
although it requires only three tapes. 

The same idea can be generalized to T tapes, for any T > 3, using (T - 1)- 
way merging. We shall see, for example, that the four-tape case requires only 
about .703 lgS + 0.96 passes over the data. The generalized pattern involves 


generalized Fibonacci 

numbers. 

Consider the following six-tape example: 


Phase 

T1 

T2 

T3 

T4 

T5 

T6 

Initial runs processed 


1 

l 31 

^30 

^28 

l 24 

l 16 

— 

31 + 30 + 28 + 24 + 16 = 

129 

2 

l 15 

l 14 

l 12 

l 8 

— 

5 16 

16 x 5 = 

80 

3 

l 7 

l 6 

l 4 

— 

9 s 

5 8 

8x9 = 

72 

4 

l 3 

l 2 

— 

17 4 

9 4 

5 4 

4 x 17 = 

68 

5 

l 1 

— - 

33 2 

17 2 

9 2 

5 2 

2 x 33 = 

66 

6 

— 

65 1 

33 1 

17 1 

9 1 

5 1 

1 x 65 = 

65 

7 

129 1 

— 

— 

— 

— 

— 

1 x 129 = 

129 


Here l 31 stands for 31 runs of relative length 1, etc.; five-way merges have 
been used throughout. This general pattern was developed by R. L. Gilstad 
[Proc. Eastern Joint Computer Conf. 18 (1960), 143-148], who called it the 
polyphase merge. The three-tape case had been discovered earlier by B. K. Betz 
[unpublished memorandum, Minneapolis-Honeywell Regulator Co. (1956)]. 

In order to make polyphase merging work as in the examples above, we 
need to have a “perfect Fibonacci distribution” of runs on the tapes after each 


5.4.2 


THE POLYPHASE MERGE 269 


phase. By reading the table above from bottom to top, we can see that the first 
seven perfect Fibonacci distributions when T = 6 are {1,0, 0,0,0}, {1,1, 1,1,1}, 
{2, 2, 2, 2,1}, {4, 4, 4, 3, 2}, {8, 8, 7, 6, 4}, {16,15,14,12,8}, and {31,30,28,24,16}. 
The big questions now facing us are 

1. What is the rule underlying these perfect Fibonacci distributions? 

2. What do we do if S does not correspond to a perfect Fibonacci distribution? 

3. How should we design the initial distribution pass so that it produces the 
desired configuration on the tapes? 

4. How many “passes” over the data will a T-tape polyphase merge require, as 
a function of S (the number of initial runs)? 

We shall discuss these four questions in turn, first giving “easy answers” and 
then making a more intensive analysis. 

The perfect Fibonacci distributions can be obtained by running the pattern 
backwards, cyclically rotating the tape contents. For example, when T = 6 we 
have the following distribution of runs: 


Level 

T1 

T2 

T3 

T4 

T5 

Total 

Final output 
will be on 

0 

1 

0 

0 

0 

0 

1 

T1 

1 

1 

1 

1 

1 

1 

5 

T6 

2 

2 

2 

2 

2 

1 

9 

T5 

3 

4 

4 

4 

3 

2 

17 

T4 

4 

8 

8 

7 

6 

4 

33 

T3 

5 

16 

15 

14 

12 

8 

65 

T2 

6 

31 

30 

28 

24 

16 

129 

T1 

7 

61 

59 

55 

47 

31 

253 

T6 

8 

120 

116 

108 

92 

61 

497 

T5 

n 

(In 

b n 

C n 

dn 


t n 

T (k) 

n + 1 

d n 4“ bn 

(In Cn 

&n “1“ dn 

&n 4" (-n 

dn 

tn 4“ 4&7T, 

T (fc - 1) 


(Tape T6 will always be empty after the initial distribution.) 

The rule for going from level n to level n + 1 shows that the condition 

0>n ^ '' > dn ^ ^ n (^) 

will hold in every level. In fact, it is easy to see from (l) that 

— ^n — 1 5 

dn — ^n — 1 4“ ^n~ 1 ^n — 1 "h ^n— 2? 

Cn — ®n— 1 “1“ ^n — 1 — ^n—l H“ ^n — 2 ^n— 3? ( 3 ) 

— ^n— 1 “I” ^n — 1 — ^n— 1 ®n— 2 “I - ^n — 3 ”1“ ^n— 4? 

dn ~ &n — 1 4" ^n— 1 ~ d n — 1 “t" ^n — 2 d - d n — 3 “1“ ^n — 4 4“ d n — 5? 

where do = 1 and where we let a n = 0 for n = — 1, —2, —3, —4. 


270 SORTING 


5.4.2 


The pth- order Fibonacci numbers are defined by the rules 

F^=Fi P \ + F^ 2 + --- + F^ p , for n > p\ 

F r { p} — 0 , for 0 < n < p - 2 ; F^\ = 1 . U ’ 

In other words, we start with p — 1 Os, then 1, and then each number is the sum 
of the preceding p values. When p = 2, this is the usual Fibonacci sequence, 
and when p = 3 it has been called the Tribonacci sequence. Such sequences were 
apparently first studied for p > 2 by Narayana Pandita in 1356 [see P. Singh. 
Historia Mathematica 12 (1985), 229-244], then many years later by V. Schlegel 
in El Progreso Matematico 4 (1894), 173-174. Schlegel derived the generating 
function 

z v ~ x Z p ~ x - z p 

1-z-z 2 zP ~ 1 - 2z + ZP + 1 ' ^ 


n>0 


The last equation of ( 3 ) shows that the number of runs on T1 during a six-tape 
polyphase merge is a fifth-order Fibonacci number: a n = f£ 4 . 

In general, if we set P = T- 1, the polyphase merge distributions for T tapes 
will correspond to Pth order Fibonacci numbers in the same way. The fcth tape 
gets 


: p(P) 1 1 p(-P) 

r n+P~2 + P n+P - 3 r t„ +K2 

initial runs in the perfect nth level distribution, for 1 < k < P, and the total 
number of initial runs on all tapes is therefore 


tn - PF n+p_ 2 + ( P tyF^+p- 3 H F - (6) 

This settles the issue of “perfect Fibonacci distributions.” But what should 
we do if S is not exactly equal to t. n , for any n? And how do we get the runs 
onto the tapes in the first place? 

When S isn’t perfect (and so few values are), we can do just as we did in 
balanced P- way merging, adding artificial “dummy runs” so that we can pretend 
S is perfect after all. There are several ways to add the dummy runs, and we 
aren t. ready yet to analyze the “best” way of doing this. We shall discuss first 
a method of distribution and dummy-run assignment that isn’t strictly optimal, 
although it has the virtue of simplicity and appears to be better than all other 
equally simple methods. 


Algorithm D ( Polyphase merge sorting with “ horizontal ” distribution). This 
algorithm takes initial runs and disperses them to tapes, one run at a time, until 
the supply of initial runs is exhausted. Then it specifies how the tapes are to 
be merged, assuming that there are T = P + 1 > 3 available tape units, using 
P ~ way merging. Tape T may be used to hold the input, since it does not receive 
any initial runs. The following tables are maintained: 

A [ jl , 1 < j < T: The perfect Fibonacci distribution we are striving for. 

D Ij3 , 1 < 3 S T: Number of dummy runs assumed to be present at the 
beginning of logical tape unit number j. 


5.4.2 


THE POLYPHASE MERGE 271 



Fig. 68. Polyphase merge sorting. 

TAPE [j] , 1 < j < T: Number of the physical tape unit corresponding to logical 
tape unit number j. 

(It is convenient to deal with “logical tape unit numbers” whose assignment to 

physical tape units varies as the algorithm proceeds.) 

Dl. [Initialize.] Set A [j] 4- D[j] 4— 1 and TAPE [j] 4- j, for 1 < j < T. Set 

A [T] 4— D [T] 4— 0 and TAPE[T] 4— T. Then set l 4— 1, j 4— 1. 

D2. [Input to tape j.} Write one run on tape number j, and decrease D[j] by 1. 
Then if the input is exhausted, rewind all the tapes and go to step D5. 

D3. [Advance j] If D [j] < D [j + 1] , increase j by 1 and return to D2. 
Otherwise if D [j] = 0, go on to D4. Otherwise set j 4— 1 and return to D2. 

D4. [Up a level.] Set l 4— l + 1, a 4— A [1] , and then for j = 1, 2, . . . , P (in 
this order) set D[j] 4- a + A[j + 1] - A [j] and A [j] 4- a + A [ j + 1] . 
(See (i) and note that A[P + 1] is always zero. At this point we will have 

D [1] > D [2] > • • • > D [T] .) Now set j 4— 1 and return to D2. 

D5. [Merge.] If l = 0, sorting is complete and the output is on TAPE[1] . Other- 
wise, merge runs from TAPE [1] , . . . , TAPE [P] onto TAPE [T] until TAPE [P] 
is empty and D [P] = 0. The merging process should operate as follows, 
for each run merged: If D[j] > 0 for all j , 1 < j < P, then increase D[T] 
by 1 and decrease each D [j] by 1 for 1 < j < P; otherwise merge one run 
from each TAPE [j] such that D [ j] = 0, and decrease D [j] by 1 for each 
other j. (Thus the dummy runs are imagined to be at the beginning of the 
tape instead of at the ending.) 

D6. [Down a level.] Set l 4-Z-l. Rewind TAPE [P] and TAPE [T] . (Actually the 
rewinding of TAPE[P] could have been initiated during step D5, just after 
its last block was input.) Then set (TAPE [1] , TAPE [2] ,..., TAPE [T] ) 4— 
(TAPE [T], TAPE [1],..., TAPE [T - 1]), (D [1] , D [2] , . . . , D [T] ) 4— (D[T], 
D [1] , . . . , D [T — 1] ), and return to step D5. | 








272 SORTING 


Fig. 69. The order in which runs 34 through 65 are 
distributed to tapes, when advancing from level 4 to 
level 5. (See the table of perfect distributions, Eq. (i).) 
Shaded areas represent the first 33 runs that were dis- 
tributed when level 4 was reached. The bottom row 
corresponds to the beginning of each tape. 


34 





35 


36 


37 

38 


39 


40 

42 


43 


44 

46 


47 


48 

51 


52 


53 

56 


57 


58 

61 


62 


63 


Ti T 2 T 3 


5.4.2 



T 4 T s 


The distribution rule that is stated so succinctly in step D3 of this algorithm 
is intended to equalize the number of dummies on each tape as well as possible. 
Figure 69 illustrates the order of distribution when we go from level 4 (33 runs) 
to level 5 (65 runs) in a six-tape sort; if there were only, say, 53 initial runs, 
all runs numbered 54 and higher would be treated as dummies. (The runs are 
actually being written at the end of the tape, but it is best to imagine them being 
written at the beginning, since the dummies are assumed to be at the beginning.) 

We have now discussed the first three questions listed above, and it remains 
to consider the number of “passes” over the data. Comparing our six-tape 
example to the table (i), we see that the total number of initial runs processed 
when S — t e was a 5 ti + + a 2 t 4 + a x t 5 + agte, excluding the initial 

distribution pass. Exercise 4 derives the generating functions 


tt(z) ^ ^ dnZ 
n> 0 

H z ) = J2 tnzn 

n> 1 


i 

1 — z — z 2 — z 3 — z A — z 5 ' 

5 z + 4z 2 + 3z 3 + 2 z 4 + z 5 
1 — z — z 2 — z z — z 4 — z 5 ' 


(7) 


It follows that, in general, the number of initial runs processed when S = t n 
is exactly the coefficient of z n in a(z)t(z), plus t n (for the initial distribution 
pass). This makes it possible to calculate the asymptotic behavior of polyphase 
merging, as shown in exercises 5 through 7, and we obtain the following results: 


Table 1 

APPROXIMATE BEHAVIOR OF POLYPHASE MERGE SORTING 


Tapes 

Phases 

Passes 

Pass/phase 

Growth ratio 

3 

2.078 In 5 + 0.672 

1.504 In S + 0.992 

72% 

1.6180340 

4 

1.641 In 5 + 0.364 

1.015 In S + 0.965 

62% 

1.8392868 

5 

1.524 In S + 0.078 

0.863 In 5 + 0.921 

57% 

1 .9275620 

6 

1.479 In S — 0.185 

0.795 In 5 + 0.864 

54% 

1.9659482 

7 

1.460 In 5 - 0.424 

0.762 In S + 0.797 

52% 

1.9835828 

8 

1.451 In S — 0.642 

0.744 In 5 + 0.723 

51% 

1.9919642 

10 

1.445 In S - 1.017 

0.728 In 5 + 0.568 

50% 

1.9980295 

20 

1.443 In S - 2.170 

0.721 In 5 -0.030 

50% 

1 .9999981 


5.4.2 


THE POLYPHASE MERGE 273 


In Table 1, the “growth ratio” is lim,,-^ t n+ i/t n , the approximate factor by 
which the number of runs increases at each level. “Passes” denotes the average 
number of times each record is processed, namely 1/S times the total number 
of initial runs processed during the distribution and merge phases. The stated 
number of passes and phases is correct in each case up to 0(S~ C ), for some e > 0, 
for perfect distributions as S — > oo. 

Figure 70 shows the average number of times each record is merged, as 
a function of S, when Algorithm D is used to handle the case of nonperfect 
numbers. Note that with three tapes there are “peaks” of relative inefficiency 
occurring just after the perfect distributions, but this phenomenon largely dis- 
appears when there are four or more tapes. The use of eight or more tapes gives 
comparatively little improvement over six or seven tapes. 



1 2 5 10 20 50 100 200 500 1000 2000 5000 


Initial runs, S 


Fig. 70. Efficiency of polyphase merge using Algorithm D. 


A closer look. In a balanced merge requiring k passes, every record is processed 
exactly k times during the course of the sort. But the polyphase procedure does 
not have this lack of bias; some records may get processed many more times 
than others, and we can gain speed if we arrange to put dummy runs into the 
oft-processed positions. 



274 SORTING 


5.4.2 


Let us therefore study the polyphase distribution more closely; instead of 
merely looking at the number of runs on each tape, as in ( 1 ), let us associate 
with each run its merge number , the number of times it will be processed during 
the complete polyphase sort. We get the following table in place of ( 1 ): 


Level 

Tl 

T2 

T3 

T4 

T5 

0 

0 

— 





1 

1 

1 

1 

1 

1 

2 

21 

21 

21 

21 

2 

3 

3221 

3221 

3221 

322 

32 

4 

43323221 

43323221 

4332322 

433232 

4332 

5 

5443433243323221 

544343324332322 

54434332433232 

544343324332 

54434332 

n 

A n 

B n 

C n 

D n 

E n 

n + 1 

(A n + 1 )B n 

(A n -f 1 )C n 

(A n + 1 )D n 

(A n + 1 )E n 

A n + 1 


Here A n is a string of a n values representing the merge numbers for each run 
on Tl, if we begin with the level n distribution; B n is the corresponding string 
for T2; etc. The notation “( A n + 1 )B n v means “A n with all values increased 
by 1 , followed by f?„.” 

Figure 71(a) shows A 5 , B 5 , C 5 , D 5 , E 5 tipped on end, showing how the 
merge numbers for each run appear on tape; notice, for example, that the run at 
the beginning of each tape will be processed five times, while the run at the end 
of Tl will be processed only once. This discriminatory practice of the polyphase 
merge makes it much better to put a dummy run at the beginning of the tape 
than at the end. Figure 71(b) shows an optimum order in which to distribute runs 
for a five-level polyphase merge, placing each new run into a position with the 
smallest available merge number. Algorithm D is not quite as good (see Fig. 69), 
since it fills some “4” positions before all of the “3” positions are used up. 


(a) 


Beginning of tape 


Fig. 71. Analysis of the fifth-level polyp 
numbers, (b) optimum distribution order. 


2 


3 




4 


5 


6 



16 


17 


18 



7 


8 


9 


10 


19 


20 


21 


22 


23 


24 


25 


26 


42 


43 


44 


45 


11 


12 


13 


14 


15 

27 


28 


29 


30 


31 

32 


33 


34 


35 


36 

46 


47 


48 


49 


50 

37 


38 


39 


40 


41 

51 


52 


53 


54 


55 

56 


57 


58 


59 


60 

61 


62 


63 


64 


65 

(b) 





ibution for six 

tapes 

: 0 


merge 


5.4.2 


THE POLYPHASE MERGE 275 


The recurrence relations (8) show that each of B n , C n , D„, and E n are 
initial substrings of A n . In fact, we can use (8) to derive the formulas 

E n = (A n ~ i) + 1, 

E n (A n —iA n — 2 ) T 1, 

C n = (A n -iA n - 2 A n - 3 ) + 1, (g) 

B n — (A n _ 1 A n - 2 A n -3A n _4) + 1 , 

A n = (A n -iA n _ 2 A n - 3 A n - 4 A n ~ 5 ) + 1, 

generalizing Eqs. (3), which treated only the lengths of these strings. Further- 
more, the rule defining the A’s implies that essentially the same structure is 
present at the beginning of every level; we have 


An — 71 Qni (lo) 

where Q n is a string of a n values defined by the law 

Qn = Qn — l (Qn — 2 + l)(<?n-3 + 2 )(Qn -4 + 3)(Qn-5 + 4), for 71 > 1; 

Q o = 0; Q n — e (the empty string) for n < 0. (11) 

Since Q n begins with Q n - 1, we can consider the infinite string Qoo, whose first 
a n elements are equal to Q n \ this string Q x essentially characterizes all the 
merge numbers in polyphase distribution. In the six-tape case, 

Qoo = 011212231223233412232334233434412232334233434452334344534454512232- • • . 

( 12 ) 

Exercise 11 contains an interesting interpretation of this string. 

Given that A n is the string m\m 2 . . . m ari , let 

A n (r)=r rai +i™ 2 + -" + i m “" 

be the corresponding generating function that counts the number of times each 
merge number appears; and define B n (x), C n (x ) 7 D n (x), E n (x) similarly. For 
example, A 4 (x) - x 4 + x 3 + x 3 + x 2 + x 3 + x 2 + x 2 + x = x 4 + 3x 3 + 3x 2 + x. 
Relations (9) tell us that 

E n (x) = x(A n -x(x)) , 

B n (x) x(A n — i(x) -f- A n — 2(3?)), 

C n (x) — x(A n -i(x) + A n _ 2 {x) + A n ^ 3 (x)), (13) 

B n (x) = x(A n -\(x) + A n _ 2 (x) + A n -3(x) + A„_ 4 ( x)), 

A n (x) = x(A n - i(x) + A n - 2 (x) + A n _ 3 (x) + A„_ 4 ( x) + A n _ 5 (x)), 

for n > 1, where A 0 (x) = 1 and A n (x) = 0 for n = -1, -2, -3, -4. Hence 

Y,A n (x)z n = 

n> 0 


1 — x(z + z 2 + z 3 + z 4 + z 5 ) 


Y,x k (z + z 2 +z 3 + z 4 + z 5 ) k . 

k> 0 (14) 


276 SORTING 


5.4.2 


Considering the runs on all tapes, we let 

T n (x) = A n (x) + B n (x) + C n (x) + D n (x) + E n (x), n> 1; (15) 

from (13) we immediately have 


T n {x) 

hence 


5A n -i(x) + 4A n _2(x) + 3j4 n _3(a;) + 2A n - 4 {x) + A n _ 5 (x), 

'ST T ( \ n — X ^ Z + 4z2 + 3z3 + 2-g4 + z 5 ) 

“ " 1 - x(z + z 2 + z 3 + z 4 + z 5 ) ' 

n. > 1 v 7 


(16) 


The form of (16) shows that it is easy to compute the coefficients of T n (x): 



z 

2 2 

z 3 

z 4 

z 5 

z 6 

z 7 

z 8 

z 9 

z 10 

1"H 

1— 1 

z 12 

z 13 

z 14 

X 

5 

4 

3 

2 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

x 2 

0 

5 

9 

12 

14 

15 

10 

6 

3 

1 

0 

0 

0 

0 

X 3 

0 

0 

5 

14 

26 

40 

55 

60 

57 

48 

35 

20 

10 

4 

X 4 

0 

0 

0 

5 

19 

45 

85 

140 

195 

238 

260 

255 

220 

170 

x 5 

0 

0 

0 

0 

5 

24 

69 

154 

294 

484 

703 

918 

1088 

1168 


The columns of this tableau give T n (x); for example, T 4 (x) = 2 x + 1 2 x 2 + 
14 x 3 + 5 x 4 . After the first row, each entry in the tableau is the sum of the five 
entries just above and to the left in the previous row. 

The number of runs in a “perfect” nth level distribution is T n (l), and the 
total amount of processing as these runs are merged is the derivative, 7 ^( 1 ). 
Now 


n>l 


5z + 4 z 2 + 3z 3 + 2z 4 + z 5 
(l - x(z + z 2 + z 3 + z 4 + z 5 )) 2 ’ 


(18) 


setting x — 1 in (16) and (18) gives a result in agreement with our earlier 
demonstration that the merge processing for a perfect nth level distribution is 
the coefficient of z n in a(z)t(z); see (7). 

We can use the functions T rl (x) to determine the work involved when dummy 
runs are added in an optimum way. Let E n (m) be the sum of the smallest m 
merge numbers in an nth level distribution. These values are readily calculated 
by looking at the columns of (17), and we find that E„(m) is given by 

m = l 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 


n=l 1234 5oooooooooooooooooooooooooooooooo 
n = 2 1234 6 8 10 12 14 oooooooooooooooooooooooo 

n = 3 1 2 3 5 7 9 11 13 15 17 19 21 24 27 30 33 36 00 00 00 00 

n = 4 1 2 4 6 8 10 12 14 16 18 20 22 24 26 29 32 35 38 41 44 47 ( 19 ) 

n = 5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 32 35 38 41 44 47 

n = 6 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 33 36 39 42 45 48 

n = 1 2 4 6 8 10 12 14 16 18 20 23 26 29 32 35 38 41 44 47 50 53 

For example, if we wish to sort 17 runs using a level -3 distribution, the total 
amount of processing is E 3 ( 17 ) = 36 ; but if we use a level -4 or level -5 distribution 


5.4.2 


THE POLYPHASE MERGE 277 


Table 2 

NUMBER OF RUNS FOR WHICH A GIVEN LEVEL IS OPTIMUM 


Level 

T = 3 

T = 4 

T = 5 

T = 6 

T = 7 

T = 8 

T = 9 

T = 10 


1 

2 

2 

2 

2 

2 

2 

2 

2 

Mi 

2 

3 

4 

5 

6 

7 

8 

9 

10 

m 2 

3 

4 

6 

8 

10 

12 

14 

16 

18 

m 3 

4 

6 

10 

14 

14 

17 

20 

23 

26 

m 4 

5 

9 

18 

23 

29 

20 

24 

28 

32 

m 5 

6 

14 

32 

35 

43 

53 

27 

32 

37 

m 6 

7 

22 

55 

76 

61 

73 

88 

35 

41 

m 7 

8 

35 

96 

109 

154 

98 

115 

136 

44 

Ms 

9 

56 

173 

244 

216 

283 

148 

171 

199 

m 9 

10 

90 

280 

359 

269 

386 

168 

213 

243 

Mio 

11 

145 

535 

456 

779 

481 

640 

240 

295 

M n 

12 

234 

820 

1197 

1034 

555 

792 

1002 

330 

M12 

13 

378 

1635 

1563 

1249 

1996 

922 

1228 

1499 

M13 

14 

611 

2401 

4034 

3910 

2486 

1017 

1432 

1818 

Mn 

15 

988 

4959 

5379 

4970 

2901 

4397 

1598 

2116 

m 15 

16 

1598 

7029 

6456 

5841 

10578 

5251 

1713 

2374 

Mi6 

17 

2574 

14953 

18561 

19409 

13097 

5979 

8683 

2576 

M17 

18 

3955 

20583 

22876 

23918 

15336 

6499 

10069 

2709 

Mia 

19 

6528 

44899 

64189 

27557 

17029 

30164 

11259 

15787 

Mig 


and position the dummy runs optimally, the total amount of processing during 
the merge phases is only E 4 (17) = E 5 (17) = 35. It is better to use level 4, even 
though 17 corresponds to a “perfect” level-3 distribution! Indeed, as S gets large 
it turns out that the optimum number of levels is many more than that used in 
Algorithm D. 

Exercise 14 proves that there is a nondecreasing sequence of numbers M n 
such that level n is optimum for M n < S < M n+1 , but not for S > M n+1 . In 
the six-tape case the table of E n (m) we have just calculated shows that 

M 0 = 0, Mi = 2, M 2 = 6, M 3 = 10, M 4 = 14. 

The discussion above treats only the case of six tapes, but it is clear that the 
same ideas apply to polyphase merging with T tapes for any T > 3; we simply 
replace 5 by P = T - 1 in all appropriate places. Table 2 shows the sequences 
M„ obtained for various values of T. Table 3 and Fig. 72 indicate the total 
number of initial runs that are processed after making an optimum distribution 
of dummy runs. (The formulas that appear at the bottom of Table 3 should 
be taken with a grain of salt, since they are least-squares fits over the range 
1 $ S < 5000, or 1 < 5 < 10000 for T — 3; this leads to somewhat erratic 
behavior because the given range of S values is not equally favorable for all T. 
As S > oc, the number of initial runs processed after an optimum polyphase 
distribution is asymptotically Slog P 5, but convergence to this asymptotic limit 
is extremely slow.) 


278 SORTING 


5.4.2 



1^1 I I I L-1..L..I I I I 1 I I I I I 1 I I I I I I I I I I I ill 

1 2 5 10 20 50 100 200 500 1000 2000 5000 

Initial runs, S 


Fig. 72. Efficiency of polyphase merge with optimum initial distribution, using the 
same assumptions as Fig. 70. 


Table 3 

INITIAL RUNS PROCESSED DURING AN OPTIMUM POLYPHASE MERGE 


s 

T = 3 

T = 4 

T = 5 

T = 6 

II 

T = 8 

T = 9 

T = 10 

10 

36 

24 

19 

17 

15 

14 

13 

12 

20 

90 

60 

49 

44 

38 

36 

34 

33 

50 

294 

194 

158 

135 

128 

121 

113 

104 

100 

702 

454 

362 

325 

285 

271 

263 

254 

500 

4641 

3041 

2430 

2163 

1904 

1816 

1734 

1632 

1000 

10371 

6680 

5430 

4672 

4347 

3872 

3739 

3632 

5000 

63578 

41286 

32905 

28620 

26426 

23880 

23114 

22073 

5 

f (1-51 

0.951 

0.761 

0.656 

0.589 

0.548 

0.539 

0.488) x S In S' + 


l (-H 

+ .14 

+.16 

+ .19 

+ .21 

+ .20 

+ .02 

+ .18) X S 


Table 4 shows how the distribution method of Algorithm D compares with 
the results of optimum distribution in Table 3. It is clear that Algorithm D is 
not very close to the optimum when S and T become large; but it is not clear 



5.4.2 


THE POLYPHASE MERGE 279 


Table 4 

INITIAL RUNS PROCESSED DURING THE STANDARD POLYPHASE MERGE 


s 

T = 3 

T = 4 

T = 5 

T = 6 

T = 7 

T = 8 

T = 9 

T = 10 

10 

36 

24 

19 

17 

15 

14 

13 

12 

20 

90 

62 

49 

44 

41 

37 

34 

33 

50 

294 

194 

167 

143 

134 

131 

120 

114 

100 

714 

459 

393 

339 

319 

312 

292 

277 

500 

4708 

3114 

2599 

2416 

2191 

2100 

2047 

2025 

1000 

10730 

6920 

5774 

5370 

4913 

4716 

4597 

4552 

5000 

64740 

43210 

36497 

32781 

31442 

29533 

28817 

28080 


how to do much better than Algorithm D without considerable complication in 
such cases, especially if we do not know S in advance. Fortunately, we rarely 
have to worry about large S (see Section 5.4.6), so Algorithm D is not too bad 
in practice; in fact, it’s pretty good. 

Polyphase sorting was first analyzed mathematically by W. C. Carter [Proc. 
IFIP Congress (1962), 62-66]. Many of the results stated above about optimal 
dummy run placement are due originally to B. Sackman and T. Singer [“A vector 
model for merge sort analysis,” an unpublished paper presented at the ACM Sort 
Symposium (November 1962), 21 pages], Sackman later suggested the horizontal 
method of distribution used in Algorithm D. Donald Shell [CACM 14 (1971), 
713 -719; 15 (1972), 28] developed the theory independently, noted relation ( 10 ), 
and made a detailed study of several different distribution algorithms. Further 
instructive developments and refinements have been made by Derek A. Zave 
[SICOMP 6 (1977), 1-39]; some of Zave’s results are discussed in exercises 15 
through 17. The generating function ( 16 ) was first investigated by W. Burge 
[Proc. IFIP Congress (1971), 1, 454-459]. 

But what about rewind time? So far we have taken “initial runs processed” 
as the sole measure of efficiency for comparing tape merge strategies. But after 
each of phases 2 through 6 , in the examples at the beginning of this section, 
it is necessary for the computer to wait for two tapes to rewind; both the 
previous output tape and the new current output tape must be repositioned at 
the beginning, before the next phase can proceed. This can cause a significant 
delay, since the previous output tape generally contains a significant percentage 
of the records being sorted (see the “pass/phase” column in Table 1). It is 
a shame to have the computer twiddling its thumbs during all these rewind 
operations, since useful work could be done with the other tapes if we used a 
different merging pattern. 

A simple modification of the polyphase procedure will overcome this prob- 
lem, although it requires at least five tapes [see Y. Cesari, Thesis, U. of Paris 
(1968), 25-27, where the idea is credited to J. Caron], Each phase in Caron’s 
scheme merges runs from T — 3 tapes onto another tape, while the remaining 
two tapes are rewinding. 


280 SORTING 


5.4.2 


For example, consider the case of six tapes and 49 initial runs. In the 
following tableau, R denotes rewinding during the phase, and T5 is assumed to 
contain the original input: 


Phase 

T1 

T2 

T3 

T4 

T5 

T6 

Write time 

Rewind time 

1 

l 11 

l 17 

I 13 

l 8 

— 

(R) 

49 


17 

2 

(R) 

l 9 

l 5 

— 

R 

3 8 

8x3 = 

24 

49 - 17 = 32 

3 

l 6 

l 4 

— 

R 

3 5 

R 

5x3 = 

15 

max(8, 24) 

4 

l 2 

— 

R 

5 4 

R 

3 4 

4x5 = 

20 

max(13, 15) 

5 

— 

R 

7 2 

R 

3 3 

3 2 

2x7 = 

14 

max(17, 20) 

6 

R 

ll 2 

R 

5 2 

3 1 

— 

2 x 11 = 

22 

max(ll, 14) 

7 

15 1 

R 

7 1 

5 1 

— 

R 

1 x 15 = 

15 

max(22, 24) 

8 

R 

ll 1 

7° 


R 

23 1 

1 x 23 = 

23 

max(15, 15) 

9 

15 1 

ll 1 

— 

R 

33° 

R 

0 x 33 = 

0 

max(20, 23) 

10 

(15°) 

— 

R 

49 1 

(R) 

(23°) 

1 x 49 = 

49 

14 

Here all the rewind time 

is 

essentially overlapped, 

except in phase 9 

(a “dummy 


phase” that prepares for the final merge), and after the initial distribution phase 
(when all tapes are rewound). If t is the time to merge the number of records in 
one initial run, and if r is the time to rewind over one initial run, this process 
takes about 182t + 40r plus the time for initial distribution and final rewind. The 
corresponding figures for standard polyphase using Algorithm D are 140f + 104r, 
which is slightly worse when r = slightly better when r = 

Everything we have said about standard polyphase can be adapted to Caron’s 
polyphase; for example, the sequence a n now satisfies the recurrence 

a n —2 a n -3 T &n — 4 ( 20 ) 

instead of ( 3 ). The reader will find it instructive to analyze this method in the 
same way we analyzed standard polyphase, since it will enhance an understand- 
ing of both methods. (See, for example, exercises 19 and 20.) 

Table 5 gives statistics about Polyphase Caron that are analogous to the 
facts about Polyphase Ordinaire in Table 1. Notice that Caron’s method actually 
becomes superior to polyphase on eight or more tapes, in the number of runs 
processed as well as in the rewind time, even though it does (T — 3)-way merging 
instead of ( T — l)-way merging! 


Table 5 

APPROXIMATE BEHAVIOR OF CARON’S POLYPHASE MERGE SORTING 


Tapes 

Phases 

Passes 

Pass/phase 

Growth ratio 

5 

3.556 In S + 0.158 

1.463 In S + 1.016 

41% 

1.3247180 

6 

2.616 In 5 - 

0.166 

0.951 In S+ 1.014 

36% 

1.4655712 

7 

2.337 In 5 - 

0.472 

0.781 In S+ 1.001 

33% 

1.5341577 

8 

2.216 In S - 

0.762 

0.699 In S + 0.980 

32% 

1.5701473 

9 

2.156 In S - 

1.034 

0.654 In S + 0.954 

30% 

1.5900054 

10 

2.124 In S - 

1.290 

0.626 In S + 0.922 

29% 

1.6013473 

20 

2.078 In 5 - 

3.093 

0.575 In S + 0.524 

28% 

1.6179086 


5.4.2 


THE POLYPHASE MERGE 281 


This may seem paradoxical until we realize that a high order of merge does 
not necessarily imply an efficient sort. As an extreme example, consider placing 
one run on T1 and n runs on T2, T3, T4, T5; if we alternately do five-way 
merging to T6 and T1 until T2, T3, T4, T5 are empty, the processing time is 
(2 n 2 + 3n) initial run lengths, essentially proportional to S 2 instead of 5 log 5, 
although five- way merging was done throughout. 

Tape splitting. Efficient overlapping of rewind time is a problem that arises 
in many applications, not just sorting, and there is a general approach that can 
often be used. Consider an iterative process that uses two tapes in the following 
way: 



T1 

T2 

Phase 1 

Output 1 
Rewind 



Phase 2 

Input 1 
Rewind 

Output 2 
Rewind 

Phase 3 

Output 3 
Rewind 

Input 2 
Rewind 

Phase 4 

Input 3 
Rewind 

Output 4 
Rewind 


and so on, where “Output fc” means write the kth output file and “Input Ac” 
means read it. The rewind time can be avoided when three tapes are used, as 
suggested by C. Weisert [ CACM 5 (1962), 102]: 



T1 

T2 

T3 

Phase 1 

Output 1.1 
Output 1.2 
Rewind 

Output 1.3 

— 

Phase 2 

Input 1.1 
Input 1.2 
Rewind 

Output 2.1 
Rewind 

Input 1.3 

Output 2.2 
Output 2.3 

Phase 3 

Output 3.1 
Output 3.2 
Rewind 

Input 2.1 
Rewind 
Output 3.3 

Rewind 
Input 2.2 
Input 2.3 

Phase 4 

Input 3.1 
Input 3.2 
Rewind 

Output 4.1 
Rewind 

Input 3.3 

Rewind 
Output 4.2 
Output 4.3 


and so on. Here “Output k.j” means write the jth third of the fcth output 
file, and “Input k.j ” means read it. Virtually all of the rewind time will be 
eliminated if rewinding is at least twice as fast as the read/write speed. Such a 
procedure, in which the output of each phase is divided between tapes, is called 
“tape splitting.” 


282 SORTING 


5.4.2 


R. L. McAllester [CACM 7 (1964), 158-159] has shown that tape splitting 
leads to an efficient way of overlapping the rewind time in a polyphase merge. 
His method can be used with four or more tapes, and it does (T— 2)-way merging. 

Assuming once again that we have six tapes, let us try to design a merge 
pattern that operates as follows, splitting the output on each level, where “I” , 
“O”, and “R”, respectively, denote input, output, and rewinding: 


Level 

T1 

T2 

T3 

T4 

T5 

T6 

Number of runs output 

7 

I 

I 

I 

I 

R 

O 

u 7 


I 

I 

I 

I 

O 

R 

v 7 

6 

I 

I 

I 

R 

O 

I 

Uq 


I 

I 

I 

O 

R 

I 

V6 

5 

I 

I 

R 

O 

I 

I 

Us 


I 

I 

O 

R 

I 

I 

Vs 

4 

I 

R 

O 

I 

I 

I 

u 4 


I 

O 

R 

I 

I 

I 

v 4 

3 

R 

O 

I 

I 

I 

I 

U 3 


O 

R 

I 

I 

I 

I 

V 3 

2 

O 

I 

I 

I 

I 

R 

U2 


R 

I 

I 

I 

I 

O 

V2 

1 

I 

I 

I 

I 

R 

O 

U\ 


I 

I 

I 

I 

O 

R 

V\ 

0 

I 

I 

I 

R 

O 

I 

Uo 


I 

I 

I 

O 

R 

I 

VO 


In order to end with one run on T4 and all other tapes empty, we need to have 

vo = 1, 
u o + v\ = 0, 
ui + v 2 = u 0 + u 0 , 

u 2 + v 3 — Ui + Vi + Wo + v q, 

u 3 + V 4 = U 2 + V 2 + Ui + Vi + u 0 + v 0 , 

V -4 + V 5 = Uz + U 3 + U 2 + V 2 + Ui + Vi + M 0 + Vo, 

U5 + V 6 — U 4 + V 4 + «3 + V 3 + U 2 + V 2 + Ml + Vi , 

etc.; in general, the requirement is that 

Un + V n+ \ = 1 + U n _! + U n — 2 + V n — 2 + U„_ 3 + t)„_ 3 + U„_ 4 + U„_ 4 (22) 

for all n > 0, if we regard Uj = Vj = 0 for all j < 0. 

There is no unique solution to these equations; indeed, if we let all the u’s be 
zero, we get the usual polyphase merge with one tape wasted! But if we choose 
u n v n+x , the rewind time will be satisfactorily overlapped. 

McAllester suggested taking 

u n — v n —x + v n — 2 + u n _ 3 + u„_ 4 , 
v n + 1 — u n - 1 + W n _2 + M„_ 3 + M n _ 4, 


5.4.2 


THE POLYPHASE MERGE 283 


so that the sequence 

(xo, 1 ‘ •*'2i -Li ■ %4 ? *^5 ? • • • ) — (no , Uo 5 ^1 ? ^1 j C 2 . U 2 , . . . ) 

satisfies the uniform recurrence z n = :r n _ 3 + x „_ 5 + x n _ 7 + x„_ 9 . However, it 
turns out to be better to let 

v n+l = Mn -1 + V n -l + M n -2 + v n-2, , , 

( 2 3) 

^n — 3 "b ^n — 3 ~b ^n— 4 “b ^n— 4j 

this sequence not only leads to a slightly better merging time, it also has the 
great virtue that its merging time can be analyzed mathematically. McAllester’s 
choice is extremely difficult to analyze because runs of different lengths may 
occur during a single phase; we shall see that this does not happen with ( 23 ). 

We can deduce the number of runs on each tape on each level by working 
backwards in the pattern ( 21 ), and we obtain the following sorting scheme: 


Level 

T1 

T2 

T3 

T4 

T5 

T 6 

Write time Rewind time 


^23 

l 21 

l 17 

I 10 

— 

l 11 

82 

23 

7 

I 19 

l 17 

I 13 

l 6 

R 

1 13 4 4 

4 x 4 = 16 

82 - 23 


l 13 

l 11 

l 7 

— 

4 6 

R 

6 x 4 = 24 

27 

6 

l 10 

I s 

l 4 

R 

4 9 

1 8 4 4 

3 x 4 = 12 

10 


l 6 

l 4 

— 

4 4 

R 

1 4 4 4 

4 x 4 = 16 

36 

5 

l 5 

l 3 

R 

4 4 7 3 

4 8 

1 3 4 4 

1x7 = 7 

17 


l 2 

— 

7 3 

R 

4 5 

4 4 

3 x 7 = 21 

23 

4 

l 1 

R 

7 3 13 1 

4 3 7 4 

4 4 

4 3 

1 x 13 = 13 

21 


— 

13 1 

R 

4271 

4 3 

4 2 

1 x 13 = 13 

34 

3 

R 

isHq 1 

7 2 13 1 

4 X 7 X 

4 2 

4 1 

1 x 19 = 19 

23 


19 1 

R 

7H3 1 

7 1 

4 1 

— 

1 x 19 = 19 

32 

2 

19 1 31° 

13H9 1 

7H3 1 

7 1 

4 1 

R 

0 x 31 = 0 

27 


R 


13 1 

7° 

— 

31 1 

1 x 31 = 31 

19 

1 

W^l 0 


13 1 

7° 

R 

3U52 0 

0 x 52 = 0 ) 

| 


W^l 0 

19 1 

13 1 

— 

52° 

R 

0 x 52 = 0 

> max(36, 31, 23) 

0 

19 1 31° 

19 1 

13 1 

R 

52°82° 

31 1 52° 

0 x 82 = 0 J 

1 


(31°) 

(19°) 

— 

82 1 

(R) 

(31°52°) 

1 x 82 = 82 

0 


Unoverlapped rewinding occurs in three places: when the input tape T5 is being 
rewound (82 units), during the first half of the level 2 phase (27 units), and 
during the final “dummy merge” phases in levels 1 and 0 (36 units). So we may 
estimate the time as 273f + 145r; the corresponding amount for Algorithm D, 
268t + 208r, is almost always inferior. 

Exercise 23 proves that the run lengths output during each phase are suc- 
cessively 

4, 4, 7, 13, 19, 31, 52, 82, 133, . . . , ( 24 ) 

a sequence (ti,t 2 ,t 3 , . . .) satisfying the law 

tn ^n — 2 “b 2t n _ 3 -f- t n _ 4 (25) 

if we regard t n = 1 for n < 0. We can also analyze the optimum placement 
of dummy runs, by looking at strings of merge numbers as we did for standard 


284 

SORTING 






polyphase in Eq. (8): 





Final 

Level 

T1 

T2 

T3 

T4 

T6 

output on 

i 

1 

1 

1 

1 

— 

T5 

2 

1 

1 

1 

— 

1 

T4 

3 

21 

21 

2 

2 

1 

T3 

4 

2221 

222 

222 

22 

2 

T2 

5 

23222 

23222 

2322 

23 

222 

T1 

6 

333323222 

33332322 

333323 

3333 

2322 

T6 

n 

A n 

B n 

C n 

D n 

E n 

T(fc) ' 

77 + 1 

(KE n + l)B n 

(KE n + l)C n 

(A"£ n +l)D n 

KE n + 1 

A' 

T(fc-l) 


5 . 4.2 


where A n = A' n A' r [, and A" consists of the last u n merge numbers of A n . The rule 
above for going from level n to level n + 1 is valid for any scheme satisfying (22). 
When we define the u’s and v’s by (23), the strings A n , . . , , E n can be expressed 
in the following rather simple way analogous to (9): 

An = (W„_ 1 W n _ 2 W n _ 3 W n _ 4 ) + 1, 

B n = (W„_ 1 W n _ 2 W„_ 3 ) + 1, 

C n = (W n ^W n _ 2 ) + 1 , 

D n — (W„_i) + 1, 

En = (W„_ 2 W„_ 3 ) + 1, (27) 


(28) 


(29) 


where 

W n = (W n _ 3 W n _ 4 Vb n _ 2 W„_ 3 ) + 1 for n > 0 , 

Wo = 0 , and W„ — e for n < 0 . 

From these relations it is easy to make a detailed analysis of the six-tape case. 

In general, when there are T > 5 tapes, we let P — T — 2 , and we define the 
sequences (u n ), (v n ) by the rules 

1+4-1 ^n—i A + * ' * + n n _ r T u n _ r , 

— V’n — r—l T ^n — r— 1 T * * * T U n _p T V n ~Pt for 71 ^ 0 , 
where r = [P/ 2 j; Vo = 1 , and u„ — v n — 0 for n < 0 . So if w n = u n +v n , we have 
w n = w n - 2 + • • ■ + Mn-r + 2 u> n - r -i + W n - r -2 + ' ' ' + U) n -p, for 71 > 0 ; (30) 

wo = 1 ; and w n — 0 for n < 0 . The initial distribution on tapes for level 
7i+l places w n + w n _i + • • • + w n -p+k runs on tape k. for 1 < k < P, and 
w n _i + • • • + w n - r on tape T; tape T — 1 is used for input. Then u n runs are 
merged to tape T while T — 1 is being rewound; v n are merged to T — 1 while T 
is rewinding; u„_i to T — 1 while T — 2 is rewinding; etc. 

Table 6 shows the approximate behavior of this procedure when 5 is not too 
small. The “pass/phase” column indicates approximately how much of the entire 
file is being rewound during each half of a phase, and approximately how much 
of the file is being written during each full phase. The tape splitting method is 
superior to standard polyphase on six or more tapes , and probably also on five, 
at least for large S. 



5.4.2 


THE POLYPHASE MERGE 285 


Table 6 

APPROXIMATE BEHAVIOR OF POLYPHASE MERGE WITH TAPE SPLITTING 


Tapes 

Phases 

Passes 

Pass/phase 

Growth ratio 

4 

2.885 In S + 0.000 

1.443 In S+ 1.000 

50% 

1.4142136 

5 

2.078 In S + 0.232 

0.929 In S+ 1.022 

45% 

1.6180340 

6 

2.078 In S - 

0.170 

0.752 In S + 1.024 

36% 

1.6180340 

7 

1.958 In S - 

0.408 

0.670 In S+ 1.007 

34% 

1.6663019 

8 

2.008 In S - 

0.762 

0.624 In S + 0.994 

31% 

1.6454116 

9 

1 .972 In S - 

0.987 

0.595 In S + 0.967 

30% 

1.6604077 

10 

2.013 In S - 

1.300 

0.580 In S + 0.941 

29% 

1.6433803 

20 

2.069 In S - 

3.164 

0.566 In S + 0.536 

27% 

1.6214947 


When T = 4 the procedure above would become essentially equivalent to 
balanced two-way merging, without overlapping the rewind time, since w 2n +i 
would be 0 for all n. So the entries in Table 6 for T — 4 have been obtained by 
making a slight modification, letting v 2 = 0, ui = 1, V\ = 0, u 0 = 0, vq = 1* 
and v n+ x = u n _i + w n _i, u n — u „_ 2 + u„_2 for n > 2. This leads to a very 
interesting sorting scheme (see exercises 25 and 26). 

EXERCISES 

1. [16] Figure 69 shows the order in which runs 34 through 65 are distributed to five 
tapes with Algorithm D; in what order are runs 1 through 33 distributed? 

► 2. [21] True or false: After two merge phases in Algorithm D (that is, on the second 
time we reach step D6), all dummy runs have disappeared. 

► 3. [22] Prove that the condition D[l] > D[2] > • • • > D[T] is always satisfied at the 
conclusion of step D4. Explain why this condition is important, in the sense that the 
mechanism of steps D2 and D3 would not work properly otherwise. 

4. [M20] Derive the generating functions (7). 

5. [HM26] (E. P. Miles, Jr., 1960.) For all p > 2, prove that the polynomial f p {z) = 
z p — z p ~ 1 — ■ ■ ■ — z — 1 has p distinct roots, of which exactly one has magnitude greater 
than unity. [Hint: Consider the polynomial z p+1 — 2 z p + 1.] 

6. [HM24] The purpose of this exercise is to consider how Tables 1, 5, and 6 were 
prepared. Assume that we have a merging pattern whose properties are characterized 
by polynomials p(z) and q(z) in the following way: (i) The number of initial runs present 
in a “perfect distribution” requiring n merging phases is [z n ] p(z)/q(z). (ii) The number 
of initial runs processed during these n merging phases is [z n ]p(z) / q(z) 2 . (iii) There 
is a “dominant root” a of <?(z _1 ) such that q(a~ l ) = 0, q' (a -1 ) / 0, p(a~ x ) / 0, and 
q{fl~ 1 ) = 0 implies that f3 = a or |/3| < |a|. 

Prove that there is a number e > 0 such that, if S is the number of runs in a 
perfect distribution requiring n merging phases, and if pS initial runs are processed 
during those phases, we have n = alnS + b + 0(S~ e ) and p = clnS + d + 0(S~ e ), 
where , . v 

a = (lna) -1 , b= -a In ( P _ / ) - 1, c = a — Q , 

\-q'{a !)7 -g'(« _1 ) 

d _ (b+l)a-p'(a~ 1 )/p(g- 1 ) + q"(a~ 1 )/q'{a~ 1 ) 

— <j'(a -1 ) 


286 


SORTING 


5 . 4.2 


7. [HM22] Let a p be the dominant root of the polynomial f p (z) in exercise 5. What 
is the asymptotic behavior of a p as p — > oo? 

8. [M20] (E. Netto, 1901.) Let be the number of ways to express m as an 
ordered sum of the integers {1, 2, . . . ,p}. For example, when p = 3 and m = 5, there 
are 13 ways, namely l + l + l + l+l = 1 + 1 + 1+2 = 1+1+2+1 = 1 + 1+3 = 1+2+1 + 1 = 

1+2 + 2 = 1+3 + 1 = 2 + 1 + 1 + 1 = 2 + 1 + 2 = 2 + 2 + 1 = 2 + 3 = 3 + 1 + 1 = 3 + 2. 

Show that Nm is a generalized Fibonacci number. 

9. [M20] Let be the number of sequences of m Os and Is such that there are 
no p consecutive Is. For example, when p = 3 and m = 5 there are 24 such sequences: 
00000, 00001, 00010, 00011, 00100, 00101, 00110, 01000, 01001, . . . , 11011. Show that 
Km is a generalized Fibonacci number. 

10 . [M27] (Generalized Fibonacci number system .) Prove that every nonnegative 
integer n has a unique representation as a sum of distinct pth order Fibonacci numbers 
Fj , for j > p, subject to the condition that no p consecutive Fibonacci numbers are 
used. 


11. [M24] Prove that the nth element of the string Q x in (12) is equal to the number 
of distinct Fibonacci numbers in the fifth-order Fibonacci representation of n - 1. [See 
exercise 10.] 


► 12. [Ml 8] Find a connection between powers of the matrix 


the perfect Fibonacci distributions ir 


/° 

0 

0 

0 

1 


0 

0 

0 

1 

1/ 


and 


► 13. [22] Prove the following rather odd property of perfect Fibonacci distributions: 
When the final output will be on tape number T, the number of runs on each other 
tape is odd-, when the final output will be on some tape other than T, the number of 
runs will be odd on that tape, and it will be even on the others. [See (1).] 


14. [M35] Let T„(x) — T nk x k , where T n (x) is the polynomial defined in (16). 

a) Show that for each k there is a number n(k) such that T lk < T 2k < ■ ■■ < T n i fc)s . > 
r T(n(k) + l)k ^ * * * • 

b) Given that T n i k i < T nk * and n < n, prove that T n i k < T nk for all k > k'. 

c) Prove that there is a nondecreasing sequence ( M n ) such that E n (S) = min^iE^S) 
when M n < S < M n +i, but E n (S) > min,>i E j(S) when S > M n+1 . [See (19).] 

15. [M43] Prove or disprove: E„-i(m) < E„(m) implies that E n (m) < E n +i(m) < 
E„+2 (m) < ■ ■ ■ . [Such a result would greatly simplify the calculation of Table 2.] 


16. [HM43] Determine the asymptotic behavior of the polyphase merge with optimum 
distribution of dummy runs. 


17. [32] Prove or disprove: There is a way to disperse runs for an optimum polyphase 
distribution in such a way that the distribution for 5+1 initial runs is formed by 
adding one run (on an appropriate tape) to the distribution for S initial runs. 

18. [30] Does the optimum polyphase distribution produce the best possible merging 
pattern, in the sense that the total number of initial runs processed is minimized, if we 
insist that the initial runs be placed on at most T — 1 of the tapes? (Ignore rewind time.) 

19. [21] Make a table analogous to (1), for Caron’s polyphase sort on six tapes. 


5.4.2 


THE POLYPHASE MERGE 287 


20. [M24] What generating functions for Caron’s polyphase sort on six tapes corre- 
spond to (7) and to (16)? What relations, analogous to (9) and (27), define the strings 
of merge numbers? 

21 . [11] What should appear on level 7 in (26)? 

22. [M21] Each term of the sequence (24) is approximately equal to the sum of the 
previous two. Does this phenomenon hold for the remaining numbers of the sequence? 
Formulate and prove a theorem about t n — t n - 1 — t n - 2. 

► 23. [29] What changes would be made to (25), (27), and (28), if (23) were changed 
to — Un — l T Vn— 1 T Un— 2, — Vn — 2 d“ Un — 3 T ^n — 3 d - Un— 4 d - Vn— 4? 

24. [HM41] Compute the asymptotic behavior of the tape-splitting polyphase proce- 
dure, when u n+ i is defined to be the sum of the first q terms of u n - 1 d- ?J n -i + • • • + 
u n — p -I- v n -p , for various P = T — 2 and for 0 < q < 2 P. (The text treats only the 
case q — 2|_.P/2J; see exercise 23.) 

25 . [19] Show how the tape-splitting polyphase merge on four tapes, mentioned at 
the end of this section, would sort 32 initial runs. (Give a phase-by-phase analysis like 
the 82-run six-tape example in the text.) 

26 . [M21] Analyze the behavior of the tape-splitting polyphase merge on four tapes, 
when S = 2 n and when S = 2 n + 2 n_1 . (See exercise 25.) 

27 . [23] Once the initial runs have been distributed to tapes in a perfect distribution, 
the polyphase strategy is simply to “merge until empty”: We merge runs from all 
nonempty input tapes until one of them has been entirely read; then we use that tape 
as the next output tape, and let the previous output tape serve as an input. 

Does this merge-until-empty strategy always sort, no matter how the initial runs 
are distributed, as long as we distribute them onto at least two tapes? (One tape will, 
of course, be left empty so that it can be the first output tape.) 

28 . [M26] The previous exercise defines a rather large family of merging patterns. 
Show that polyphase is the best of them, in the following sense: If there are six tapes, 
and if we consider the class of all initial distributions (a, b, c, d , e) such that the merge- 
until-empty strategy requires at most n phases to sort, then a+b+c+d+e < t n , 
where t n is the corresponding value for polyphase sorting (l). 

29 . [M47] Exercise 28 shows that the polyphase distribution is optimal among all 
merge-until-empty patterns in the minimum-phase sense. But is it optimal also in the 
minimum-pass sense? 

Let a be relatively prime to b, and assume that a + b is the Fibonacci number F n . 
Prove or disprove the following conjecture due to R. M. Karp: The number of initial 
runs processed during the merge-until-empty pattern starting with distribution (a, b) 
is greater than or equal to ((n — 5)F n+1 4- (2 n + 2)F n )/b. (The latter figure is achieved 
when a = F n _ 1, b = F n - 2-) 

30. [42] Prepare a table analogous to Table 2, for the tape-splitting polyphase merge. 

31 . [M2 2] (R. Kemp.) Let Kd(n) be the number of n-node ordered trees in which 
every leaf is at distance d from the root. For example, A' 3 (8) = 7 because of the trees 


Show that Kd(n) is a generalized Fibonacci number, and find a one-to-one correspon- 
dence between such trees and the ordered partitions considered in exercise 8. 


288 


SORTING 


5.4.3 


*5.4.3. The Cascade Merge 

Another basic pattern, called the “cascade merge,” was actually discovered 
before polyphase [B. K. Betz and W. C. Carter, ACM National Meeting 14 
(1959), Paper 14]. This approach is illustrated for six tapes and 190 initial runs 
in the following table, using the notation developed in Section 5.4.2: 



Tl 

T2 

T3 

T4 

T5 

T6 

Initial runs 

processed 

Pass 1 

j55 

^50 

l 41 

^29 

l 15 

— 

190 

Pass 2 

— 

*1 5 

2 9 

3 12 

4 14 

5 15 

190 

Pass 3 

15 s 

14 4 

12 3 

9 2 

*5 4 

— 

190 

Pass 4 

— 

*15 4 

29 1 

41 1 

50 1 

55 1 

190 

Pass 5 

190 1 

— 

— 

— 

— 

— 

190 

A cascade 

merge, 

like polyphase, 

starts out 

with a 

“perfect 

distribution” of 


runs on tapes, although the rule for perfect distributions is somewhat different 
from those in Section 5.4.2. Each line in the table represents a complete pass 
over all the data. Pass 2, for example, is obtained by doing a five-way merge 
from {Tl, T2, T3, T4, T5} to T6, until T5 is empty (this puts 15 runs of relative 
length 5 on T6), then a four-way merge from {Tl, T2, T3, T4} to T5, then a 
three-way merge to T4, a two-way merge to T3, and finally a one-way merge 
(a copying operation) from Tl to T2. Pass 3 is obtained in the same way, first 
doing a five-way merge until one tape becomes empty, then a four-way merge, 
and so on. (Perhaps the present section of this book should be numbered 5. 4.3. 2.1 
instead of 5.4.3!) 

It is clear that the copying operations are unnecessary, and they could be 
omitted. Actually, however, in the six-tape case this copying takes only a small 
percentage of the total time. The items marked with an asterisk in the table 
above are those that were simply copied; only 25 of the 950 runs processed are 
of this type. Most of the time is devoted to five-way and four-way merging. 


Table 1 

APPROXIMATE BEHAVIOR OF CASCADE MERGE SORTING 


Tapes 

Passes (with copying) 

Passes (without copying) 

Growth ratio 

3 

2.078 In S + 0.672 

1.504 In S + 0.992 

1.6180340 

4 

1.235 In S + 0.754 

1.102 In S + 0.820 

2.2469796 

5 

0.946 In 5 + 0.796 

0.897 In S + 0.800 

2.8793852 

6 

0.796 In S + 0.821 

0.773 In S + 0.808 

3.5133371 

7 

0.703 In S + 0.839 

0.691 In S + 0.822 

4.1481149 

8 

0.639 In S + 0.852 

0.632 In S + 0.834 

4.7833861 

9 

0.592 In S + 0.861 

0.5871nS + 0.845 

5.4189757 

10 

0.555 In S + 0.869 

0.552 In S + 0.854 

6.0547828 

20 

0.397 In S + 0.905 

0.397 In S + 0.901 

12.4174426 


At first it may seem that the cascade pattern is a rather poor choice, by 
comparison with polyphase, since standard polyphase uses (T — l)-way merging 


5.4.3 


THE CASCADE MERGE 289 


throughout while the cascade uses (T — l)-way, (T — 2)-way, (T — 3)-way, etc. 
But in fact it is asymptotically better than polyphase, on six or more tapes! As 
we have observed in Section 5.4.2, a high order of merge is not a guarantee of 
efficiency. Table 1 shows the performance characteristics of cascade merge, by 
analogy with the similar tables in Section 5.4.2. 

The “perfect distributions” for a cascade merge are easily derived by working 
backwards from the final state (1,0, ... ,0). With six tapes, they are 


Level 

T1 

T2 

T3 

T4 

T5 

0 

1 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

2 

5 

4 

3 

2 

1 

3 

15 

14 

12 

9 

5 

4 

55 

50 

41 

29 

15 

5 

190 

175 

146 

105 

55 

n 



Cn 

dn 


n+1 

"I - H - dn ^ n 

'‘f - d n 

®"n "1“ “t - 




It is interesting to note that the relative magnitudes of these numbers appear 
also in the diagonals of a regular (2T — l)-sided polygon. For example, the five 
diagonals in the hendecagon of Fig. 73 have relative lengths very nearly equal 
to 190, 175, 146, 105, and 55! We shall prove this remarkable fact later in this 
section, and we shall also see that the relative amount of time spent in (T— l)-way 
merging, (T — 2)-way merging, . . . , 1-way merging is approximately proportional 
to the squares of the lengths of these diagonals. 



Fig. 73. Geometrical interpretation of cascade numbers. 

Initial distribution of runs. When the actual number of initial runs isn’t 
perfect, we can insert dummy runs as usual. A superficial analysis of this situ- 
ation would indicate that the method of dummy run assignment is immaterial, 


290 SORTING 


5.4.3 



Fig. 74. Efficiency of cascade merge with the distribution of Algorithm D. 


since cascade merging operates by complete passes; if we have 190 initial runs, 
each record is processed five times as in the example above, but if there are 191 
we must apparently go up a level so that every record is processed six times. 
Fortunately this abrupt change is not actually necessary; David E. Ferguson has 
found a way to distribute initial runs so that many of the operations during the 
first merge pass reduce to copying the contents of a tape. When such copying 
relations are bypassed (by simply changing “logical” tape unit numbers relative 
to the “physical” numbers as in Algorithm 5. 4. 2D), we obtain a relatively smooth 
transition from level to level, as shown in Fig. 74. 

Suppose that (a, b, c, d, e) is a perfect distribution, where a > b > c > d > a. 
By redefining the correspondence between logical and physical tape units, we 
can imagine that the distribution is actually (■ e,d,c,b,a ), with a runs on T5, 
b on T4, etc. The next perfect distribution is ( a+b+c+d+e , a+b+c+d, a+b+c, 
a+b, a); and if the input is exhausted before we reach this next level, let us 
assume that the tapes contain, respectively, (Di, D 2 , D 3 , D 4 , D 5 ) dummy runs, 
where 

D\ < a + b + c + d, D 2 < a + b + c, D 3 < a + b, D 4 < a, D$ = 0; 

D\ > D 2 > D 3 > D 4 > D5. (2) 



5.4.3 


THE CASCADE MERGE 291 


We are free to imagine that the dummy runs appear in any convenient place 
on the tapes. The first merge pass is supposed to produce a runs by five-way 
merging, then b by four- way merging, etc., and our goal is to arrange the dummies 
so as to replace merging by copying. It is convenient to do the first merge pass 
as follows: 

1. If .D 4 = a, subtract a from each of Di, D 2 , D 3 , H 4 and pretend that 
T5 is the result of the merge. If £>4 < a, merge a runs from tapes T1 through 
T5, using the minimum possible number of dummies on tapes T1 through T5 so 
that the new values of D 1 , D 2 , -D 3 , D 4 will satisfy 

D\ ^ b T c T d, D 2 b * c. D 2 + 6 , H 4 — 0; 

D\ > D 2 > > D 4 . ( 3 ) 

Thus, if D 2 was originally < b + c, we use no dummies from it at this step, while 
\{ b + c < D 2 < a + b + c we use exactly D 2 — b — c of them. 

2. (This step is similar to step 1, but “shifted.”) If D 3 — b, subtract b from 
each of D\, D 2 , D 3 and pretend that T4 is the result of the merge. If D 3 < b, 
merge b runs from tapes T1 through T4, reducing the number of dummies if 
necessary in order to make 

D\ < c + d, D 2 < c, D 2 = 0; D\ > D 2 > D 3 . 

3. And so on. 


Table 2 

EXAMPLE OF CASCADE DISTRIBUTION STEPS 


Add to Tl 

Add to T2 

Add to T3 

Add to T4 

Add to T5 

“Amount saved” 

Step (1,1) 

9 

0 

0 

0 

0 

15 + 14+12 + 5 

Step (2,2) 

3 

12 

0 

0 

0 

15 + 14 + 9 + 5 

Step (2,1) 

9 

0 

0 

0 

0 

15 + 14 + 5 

Step (3,3) 

2 

2 

14 

0 

0 

15 + 12 + 5 

Step (3,2) 

3 

12 

0 

0 

0 

15 + 9 + 5 

Step (3,1) 

9 

0 

0 

0 

0 

15 + 5 

Step (4,4) 

1 

1 

1 

15 

0 

14 + 5 

Step (4,3) 

2 

2 

14 

0 

0 

12 + 5 

Step (4,2) 

3 

12 

0 

0 

0 

9 + 5 

Step (4,1) 

9 

0 

0 

0 

0 

5 


Ferguson’s method of distributing runs to tapes can be illustrated by con- 
sidering the process of going from level 3 to level 4 in ( 1 ). Assume that “logical” 
tapes (Tl, . . . ,T5) contain respectively (5,9, 12, 14, 15) runs and that we want 
eventually to bring this up to (55, 50, 41, 29, 15). The procedure can be summa- 
rized as shown in Table 2. We first put nine runs on Tl, then (3,12) on Tl 
and T2, etc. If the input becomes exhausted during, say, Step (3,2), then the 
“amount saved” is 15 + 9 + 5, meaning that the five-way merge of 15 runs, the 
two-way merge of 9 runs, and the one-way merge of 5 runs are avoided by the 
dummy run assignment. In other words, 15 + 9 + 5 of the runs present at level 
3 are not processed during the first merge phase. 


292 SORTING 


5.4.3 


The following algorithm defines the process in detail. 


Algorithm C ( Cascade merge sorting with special distribution). This algorithm 
takes initial runs and disperses them to tapes, one run at a time, until the supply 
of initial runs is exhausted. Then it specifies how the tapes are to be merged, 
assuming that there are T > 3 available tape units, using at most ( T — l)-way 
merging and avoiding unnecessary one-way merging. Tape T may be used to 
hold the input, since it does not receive any initial runs. The following tables 
are maintained: 


ACj], 1 <j<T: 


The perfect cascade distribution we have most recently 
reached. 


AA[j] , 1 < j < T: The perfect cascade distribution we are striving for. 

1 < j < T: Number of dummy runs assumed to be present on logical 
tape unit number j. 

Mtj], 1 < j < T: Maximum number of dummy runs desired on logical tape 
unit number j. 


TAPE [j] , 1 < j < T: Number of the physical tape unit corresponding to logical 
tape unit number j. 

Cl. [Initialize.] Set A [A] «- AA[A] <- D[A] 0 for 2 < A < T; and set 
A [1] <- 0, AA [1] <- 1, D [1] <- 1. Set TAPE [A:] <- k for 1 < A < T. Finally 
set * <— T — 2, j «— 1, A <— 1, l 0, m t— 1, and go to step C5. (This 
maneuvering is one way to get everything started, by jumping right into the 
inner loop with appropriate settings of the control variables.) 


C2. [Begin new level.] (We have just reached a perfect distribution, and since 
there is more input we must get ready for the next level.) Increase l by 1. Set 
A [A] <- AA [A] , for 1 < A < T; then set AA [T - A] <- AA [T — A + 1] -f A [A] , 
for A = 1, 2, ..., T — 1 in this order. Set (TAPE [1] ,..., TAPE [T-l] ) <- 
(TAPE [T-1],...,TAPE[1]), and set D [A] <- AA[A+1] for 1 < A < T. 
Finally set i t— 1. 


C3. [Begin zth sublevel.] Set j 4— i. (The variables i and j represent “Step 
(i,jf in the example shown in Table 2.) 

C4. [Begin Step Set A <- j and m <- A [T - j - 1] . If m = 0 and i = j. 

set i <r- T - 2 and return to C3; if m = 0 and i ± j , return to C2. (Variable 
m represents the number of runs to be written onto TAPE [A] ; m — 0 occurs 
only when 1 = 1.) 

C5. [Input to TAPE [A].] Write one run on tape number TAPE [A], and decrease 
D[A] by 1. Then if the input is exhausted, rewind all the tapes and go to 
step C7. 


C6. [Advance.] Decrease to by 1. If m > 0, return to C5. Otherwise decrease A 
by 1; if A > 0, set to «— A [T — j — 1] — A [T — j] and return to C5 if m > 0. 
Otherwise decrease j by 1; if j > 0, go to C4. Otherwise increase i by 1; if 
i < T — 1, return to C3. Otherwise go to C2. 


5.4.3 


THE CASCADE MERGE 293 



Fig. 75. The cascade merge, with special distribution. 

C7. [Prepare to merge.] (At this point the initial distribution is complete, and 
the AA, D, and TAPE tables describe the present states of the tapes.) Set 
M[fc] <- AA[fc + 1] for 1 < k < T, and set FIRST <- 1. (Variable FIRST is 
nonzero only during the first merge pass.) 

C8. [Cascade.] If l — 0, stop; sorting is complete and the output is on TAPE[1] . 
Otherwise, for p = T — 1, T — 2, . . . , 1, in this order, do a p-way merge from 
TAPE[1] , . . . , TAPE[p] to TAPE[p + 1] as follows: 

If p — 1, simulate the one-way merge by simply rewinding TAPE [2], then 
interchanging TAPE[1] TAPE [2] . 

Otherwise if FIRST = 1 and D [p — 1] — M [p — 1] , simulate the p-way merge 
by simply interchanging TAPE[p] <-» TAPE[p+ 1], rewinding TAPE[p], and 
subtracting M[p — 1] from each of D [1] , . . . , D [p-1] , M[l] , . , , , M[p— 1] . 
Otherwise, subtract M[p — 1] from each of M[l] , . , . , M[p — 1] . Then merge 
one run from each TAPE[j] such that 1 < j < p and D[j] < M[j] ; subtract 
one from each D[j] such that 1 < j < p and D[j] > M[j]; and put the 
output run on TAPE [p + 1] . Continue doing this until TAPE [p] is empty. 
Then rewind TAPE[p] and TAPE[p + 1] . 

C9. [Down a level.] Decrease l by 1, set FIRST 4- 0, and set (TAPE[1] , . . . , 
TAPE[T]) (TAPE[T] , . . . ,TAPE[1] ). (At this point all D’s and M’s are 
zero and will remain so.) Return to C8. | 

Steps C1-C6 of this algorithm do the distribution, and steps C7-C9 do the 
merging; the two parts are fairly independent of each other, and it would be 
possible to store M [fc] and AA [A- + 1] in the same memory locations. 







294 SORTING 


5.4.3 


Analysis of cascade merging. The cascade merge is somewhat harder to 
analyze than polyphase, but the analysis is especially interesting because so 
many remarkable formulas are present. Readers who enjoy discrete mathematics 
are urged to study the cascade distribution for themselves, before reading further, 
since the numbers have extraordinary properties that are a pleasure to discover. 
We shall discuss here one of the many ways to approach the analysis, emphasizing 
the way in which the results might be discovered. 

For convenience, let us consider the six-tape case, looking for formulas that 
generalize to all T. Relations (1) lead to the first basic pattern: 


a n ~ a n — ^ 0 Ja n , 

^n— 1 

2 — (0) 

( 2 )a n _ 2, 


c n b n d n — 1 

&n— 2 ^n — 2 = (o)^ n 

(2) 2~^~ (4) ^n— 4 ? 

(4) 

— c n Cn—1 

C71 ^n — 2 ^n— 2 Cn—2 = (o)^ n 

(2) ^n— 2~^~ (4) ®n — 4 ( 

6\ 

Qj a n—6i 

d n b n — 1 

dn &n — 2 b n — 2 Cn—2 d n — 2 ~~ (q) d-n 

(2) ^ n— 2“l _ (4) ^n — 4 ( 

g) ^n— 6 “I" (3) ^n — 8 

Let A(z) = E n >0 , E{z) = En>0 

e n z n , and define the polynomials 


fc k—0 K 

The result of (4) can be summarized by saying that the generating functions 
B(z) - qi (z)A(z), C(z) - q 2 (z)A(z), D(z ) - q 3 (z)A(z), and E(z) - q 4 (z)A(z) 
reduce to finite sums, corresponding to the values of a_i, a_ 2 , a_ 3 , . . . that appear 
in (4) for small n but do not appear in A(z). In order to supply appropriate 
boundary conditions, let us run the recurrence backwards to negative levels, 
through level —8: 


n 

Un 

b-n 

e'n 

dn 

e n 

0 

1 

0 

0 

0 

0 

-1 

0 

0 

0 

0 

1 

-2 

1 

-1 

0 

0 

0 

-3 

0 

0 

0 

-1 

2 

-4 

2 

-3 

1 

0 

0 

-5 

0 

0 

1 

-4 

5 

-6 

5 

-9 

5 

-1 

0 

-7 

0 

-1 

6 

-14 

14 

-8 

14 

-28 

20 

-7 

1 


5.4.3 


THE CASCADE MERGE 295 


(On seven tapes the table would be similar, with entries for odd n shifted right 
one column.) The sequence ao, a_ 2l a_ 4, ... = 1, 1, 2, 5, 14, ... is a dead giveaway 
for computer scientists, since it occurs in connection with so many recursive 
algorithms (see, for example, exercise 2.2. 1-4 and Eq. 2 . 3 . 4 . 4 -(i 4 )); therefore we 
conjecture that in the T-tape case 


a-2n — 

a~2 n — 1 = 0, 


/ 2 n\ 1 

V n ) n + 1 ’ 


for 0 < n < T — 2; 
for 0 < n < T - 3. 


( 6 ) 


To verify that this choice is correct, it suffices to show that ( 6 ) and ( 4 ) yield the 
correct results for levels 0 and 1. On level 1 this is obvious, and on level 0 we 
have to verify that 


{ m\ fm+ 1 \ / m + 2 \ / m + 3 \ 

Ur°~l 2 r~ 2+ l 4 e j a - 6+ --- 

'2k\ (-l )* 1 


= £ 

k> 0 


m + k 
2k 


k ) k + 1 


= 6 . 


m 0 


(7) 


for 0 < m < T — 2. Fortunately this sum can be evaluated by standard tech- 
niques; it is, in fact, Example 2 in Section 1.2.6. 

Now we can compute the coefficients of B(z) — qi(z)A(z), etc. For example, 
consider the coefficient of z 2m in D(z) — q^{z)A(yz): It is 


E( 

fc >0 


3 + m + k 
2m + 2k 


(- 1 ) 


m-f k 


«-2 k 


'3 + m + k^ [Zk^ (— 1) 
2m + 2k 


k 


) 


771 + fc 


gc 

=<- 1 > m+ ‘( 2 2 H T)' 

by the result of Example 3 in Section 1.2.6. Therefore we have deduced that 
A(z) = q 0 (z)A(z), 

B(z) = q 1 (z)A(z) - q 0 {z), C(z ) = q 2 {z)A(z) - q^z), 

D(z) = q 3 (z)A(z) - q 2 {z), E(z) = q 4 {z)A(z) - q 3 (z). 

Furthermore we have e n+ i = a n ; hence zA(z) — E(z), and 

A{z) = q 3 (z)/ (q 4 (z) - z). ( 9 ) 

The generating functions have now been derived in terms of the q polyno- 
mials, and so we want to understand the q ' s better. Exercise 1.2.9-15 is useful 
in this regard, since it gives us a closed form that may be written 


( 8 ) 




((V4 — z 2 + iz)/2) 


2m+l 


((V^ 


iz)/2) 


2m+l 




296 SORTING 


5.4.3 


9m (2 sin#) = 


(n) 


Everything simplifies if we now set 2 = 2 sin#: 

(cos # + i sin #) 2m+1 + (cos # - i sin #) 2m+1 _ cos(2m+l)# 

2 cos # - cos # 

(This coincidence leads us to suspect that the polynomial q m (z) is well known in 
mathematics; and indeed, a glance at appropriate tables will show that q m {z) is 
essentially a Chebyshev polynomial of the second kind, namely {-l) m U 2m \z/2) 
in conventional notation.) 

We can now determine the roots of the denominator in (9): The equation 
94(2 sin#) = 2 sin# reduces to 

cos 9# = 2 sin # cos # = sin 26. 

We can obtain solutions to this relation whenever ±9# = 2# + (2 n - \)n: and 
all such # yield roots of the denominator in (9) provided that cos# ^ 0. (When 
cos# — 0, 9 to (± 2) = (2m + 1) is never equal to ±2.) The following eight distinct 
roots for 94(2) — z = 0 are therefore obtained: 

2 sin 2sinf±7r, 2sin^7r; 2sin^7r, 2sin=f7T, 2sini7r, 2sin^7r, 2sin^7r. 

Since q 4 (z) is a polynomial of degree 8, this accounts for all the roots. The first 
three of these values make qs(z) = 0, so 93(2) and 94(2) — z have a polynomial 
of degree three as a common factor. The other five roots govern the asymptotic 
behavior of the coefficients of A(z), if we expand (9) in partial fractions. 

Considering the general T-tape case, let 6 k = (4 A + 1)tt/( 4T - 2). The 
generating function A(z ) for the T-tape cascade distribution numbers takes the 
form 

4 \ - cos 2 #*, 

2T — 1 ^ 


-T/2<fc<LT/2J 


1 - z/(2 sin # fe ) 


(12) 


(see exercise 8); hence 


2T — 1 


Y! co&2 

-T/2<fc<[T/2j 


2 sin #/; 


The equations in (8) now lead to the similar formulas 
4 


= 


dri — 


2 T - 1 
4 

2T — 1 
4 

2T — 1 


Y2 cos #fc cos 3 9 k 

-T/2<k< \ T /2J 

Y2 COS #fc COS 59 k 

-T/2<fc<[T/2J 

Y2 cos#fcCos7#fc 

-T/2<fc<[T/2j 


1 


2 sin 9 k 

1 

2 sin 9 k 

1 

2 sin 9 k 


(13) 


(14) 


and so on. Exercise 9 shows that these equations hold for all n > 0. not only 
for large n. In each sum the term for A: = 0 dominates all the others, especially 


5.4.3 


THE CASCADE MERGE 297 


when n is reasonably large; therefore the “growth ratio” is 


— - — = ~T - - + — + 0(T~ 
2 sin 9q n it 48 T 


(i5) 


Cascade sorting was first analyzed by W. C. Carter [Proc. IFIP Congress 
(1962), 62-66], who obtained numerical results for small T, and by David E. 
Ferguson [see CACM 7 (1964), 297], who discovered the first two terms in the 
asymptotic behavior ( 15 ) of the growth ratio. During the summer of 1964, 
R. W. Floyd discovered the explicit form 1/(2 sin#o) of the growth ratio, so that 
exact formulas could be used for all T. An intensive analysis of the cascade 
numbers was independently carried out by G. N. Raney [Canadian J. Math. 18 
(1966), 332-349], who came across them in quite another way having nothing to 
do with sorting. Raney observed the “ratio of diagonals” principle of Fig. 73, 
and derived many other interesting properties of the numbers. Floyd and Raney 
used matrix manipulations in their proofs (see exercise 6 ). 


Modifications of cascade sorting. If one more tape is added, it is possible 
to overlap nearly all of the rewind time during a cascade sort. For example, 
we can merge T1-T5 to T7, then T1-T4 to T 6 , then T1-T3 to T5 (which by 
now is rewound), then T1-T2 to T4, and the next pass can begin when the 
comparatively short data on T4 has been rewound. The efficiency of this process 
can be predicted from the analysis of cascading. (See Section 5.4.6 for further 
information.) 

A “compromise merge” scheme, which includes both polyphase and cascade 
as special cases, was suggested by D. E. Knuth in CACM 6 (1963), 585-587. 
Each phase consists of (T - l)-way, (T - 2)-way, . . . , P- way merges, where P 
is any fixed number between 1 and T — 1. When P = T — 1, this is polyphase, 
and when P = 1 it is pure cascade; when P — 2 it is cascade without copy 
phases. Analyses of this scheme have been made by C. E. Radke [IBM Systems 
J. 5 (1966), 226-247] and by W. H. Burge [Proc. IFIP Congress (1971), 1, 454 
459], Burge found the generating function J2 T n (x)z n for each (P, T) compromise 
merge, generalizing Eq. 5.4.2— ( 16 ); he showed that the best value of P, from the 
standpoint of fewest initial runs processed as a function of S as S -A 00 (using 
a straightforward distribution scheme and ignoring rewind time) , is respectively 
(2, 3, 3, 4, 4, 4, 3, 3, 4) for T — (3,4,5,6,7,8,9,10,11). These values of P lean 
more towards cascade than polyphase as T increases; and it turns out that the 
compromise merge is never substantially better than cascade itself. On the other 
hand, with an optimum choice of levels and optimum distribution of dummy 
runs, as described in Section 5.4.2, pure polyphase seems to be best of all the 
compromise merges; unfortunately the optimum distribution is comparatively 
difficult to implement. 

Tli. L. Johnsen [ BIT 6 (1966), 129-143] has studied a combination of bal- 
anced and polyphase merging; a rewind-overlap variation of balanced merging 
has been proposed by M. A. Goetz [ Digital Computer User’s Handbook, edited 
by M. Klerer and G. A. Korn (New York: McGraw-Hill, 1967), 1.311-1.312]; 
and many other hybrid schemes can be imagined. 


298 SORTING 


5.4.3 


EXERCISES 

1. [10] Using Table 1, compare cascade merging with the tape-splitting version of 
polyphase described in Section 5.4.2. Which is better? (Ignore rewind time.) 

► 2. [22] Compare cascade sorting on three tapes, using Algorithm C, to polyphase 
sorting on three tapes, using Algorithm 5.4.2D. What similarities and differences can 
you find? 

3. [23] Prepare a table that shows what happens when 100 initial runs are sorted 
on six tapes using Algorithm C. 

4. [M20] (G. N. Raney.) An “nth level cascade distribution” is a multiset defined 
as follows (in the case of six tapes): {1,0, 0,0,0} is a 0th level cascade distribution; 
and if {a,b, c, d, e} is an nth level cascade distribution, {a + b + c + d + e, a + b + c + d , 
a + b+c, a + b, a} is an (n + l)st level cascade distribution. (A multiset is unordered, 
hence up to 5! different (n + l)st level distributions can be formed from a single nth 
level distribution.) 

a) Prove that any multiset {a, b, c, d, e} of relatively prime integers is an nth level 
cascade distribution, for some n. 

b) Prove that the distribution defined for cascade sorting is optimum , in the sense 
that, if {a, b, c, d, e} is any nth level distribution with a>b>c>d>e, we have 
a < a n , b < b n , c < c„, d < d n , e < e„, where ( a n ,b n ,c n ,d n ,e n ) is the distribution 
defined in (l). 

► 5. [20] Prove that the cascade numbers defined in (l) satisfy the law 

o-kUn-k + bkb n -k + CkC n -k + dfcdn-fc + eke n -k = a n , for 0 < k < n. 

[Hint: Interpret this relation by considering how many runs of various lengths are 
output during the fcth pass of a complete cascade sort.] 

6. [ M20 ] Find a 5 X 5 matrix Q such that the first row of Q n contains the six-tape 
cascade numbers a n b n c n d n e„ for all n > 0. 

7 . [ M20 ] Given that cascade merge is being applied to a perfect distribution of a n 
initial runs, find a formula for the amount of processing saved when one-way merging 
is suppressed. 

8. [HM23] Derive (12). 

9. [HM26] Derive (14). 

► 10. [ M28 ] Instead of using the pattern (4) to begin the study of the cascade numbers, 
start with the identities 

e l = a n - 1 = (J)a n _l, 

dn = 2a n _i — e n _2 = ( 1 )a n -i — (g)a n _3, 

c n = 3a n _i — 2 — 2e n _2 = (j)a n _ 1 — Q)a n _3 — 

etc. Letting 

t \ - ( m \ (m+ 1 \ 3 / m + 2 \ 5 

*•“(*)■- (J* ( 3 )z +( 5 )z -•••, 

express A(z), B(z), etc. in terms of these r polynomials. 

11. [M3 8] Let 

^( z ) = E( L(m+ / )/2J )(- l ) rfc/ v. 


5.4.4 


READING TAPE BACKWARDS 299 


Prove that the generating function A(z) for the T - tape cascade numbers is equal to 
fr- 3 {z) / fr-i(z), where the numerator and denominator in this expression have no 
common factor. 

12 . [MJfO) Prove that Ferguson’s distribution scheme is optimum, in the sense that 
no method of placing the dummy runs, satisfying ( 2 ), will cause fewer initial runs to 
be processed during the first pass, provided that the strategy of steps C7-C9 is used 
during this pass. 

13 . [40] The text suggests overlapping most of the rewind time, by adding an extra 
tape. Explore this idea. (For example, the text’s scheme involves waiting for T4 to 
rewind; would it be better to omit T4 from the first merge phase of the next pass?) 

*5.4.4. Reading Tape Backwards 

Many magnetic tape units have the ability to read tape in the opposite direction 
from which it was written. The merging patterns we have encountered so far 
always write information onto tape in the “forward” direction, then rewind the 
tape, read it forwards, and rewind again. The tape files therefore behave as 
queues, operating in a first-in-first-out manner. Backwards reading allows us to 
eliminate both of these rewind operations: We write the tape forwards and read 
it backwards. In this case the files behave as stacks, since they are used in a 
last-in-first-out manner. 

The balanced, polyphase, and cascade merge patterns can all be adapted to 
backward reading. The main difference is that merging reverses the order of the 
runs when we read backwards and write forwards. If two runs are in ascending 
order on tape, we can merge them while reading backwards, but this produces 
descending order. The descending runs produced in this way will subsequently 
become ascending on the next pass; so the merging algorithms must be capable 
of dealing with runs in either order. Programmers who are confronted with 
read-backwards for the first time often feel like they are standing on their heads! 

As an example of backwards reading, consider the process of merging 8 initial 
runs, using a balanced merge on four tapes. The operations can be summarized 
as follows: 



T1 

T2 

T3 

T4 

Pass 1 

A 1 A 1 A 1 A 1 

AiAiAxAi 

— 

Initial distribution 

Pass 2 

— 

— 

D 2 D 2 

D 2 D 2 Merge to T3 and T4 

Pass 3 

a 4 

a 4 

— 

Merge to T1 and T2 

Pass 4 

— 

— 

d 8 

Final merge to T3 


Here A r stands for a run of relative length r that appears on tape in ascending 
order, if the tape is read forwards as in our previous examples; D r is the 
corresponding notation for a descending run of length r. During Pass 2 the 
ascending runs become descending: They appear to be descending in the input, 
since we are reading T1 and T2 backwards. Then the runs switch orientation 
again on Pass 3. 

Notice that the process above finishes with the result on tape T3, in de- 
scending order. If this is bad (depending on whether the output is to be read 


300 SORTING 


5.4.4 


backwards, or to be dismounted and put away for future use), we could copy it 
to another tape, reversing the direction. A faster way would be to rewind T1 
and T2 after Pass 3, producing A 8 during Pass 4. Still faster would be to start 
with eight descending runs during Pass 1, since this would interchange all the 
A’s and D' s. However, the balanced merge on 16 initial runs would require the 
initial runs to be ascending; and we usually don’t know in advance how many 
initial runs will be formed, so it is necessary to choose one consistent direction. 
Therefore the idea of rewinding after Pass 3 is probably best. 

The cascade merge carries over in the same way. For example, consider 
sorting 14 initial runs on four tapes: 



Tl 

T2 

T3 

T4 

Pass 1 

A\AiAiAiA\Ai 

A\AiAiA\Ai 

A 1 A 1 A 1 

— 

Pass 2 

— 


D 2 D 2 

D3D3D 

Pass 3 

^4-6 

^5 

^3 

— 

Pass 4 

— 

— 


D \ 4 


3 


Again, we could produce A 14 instead of D 14 , if we rewound Tl, T2, T3 just 
before the final pass. This tableau illustrates a “pure” cascade merge, in the 
sense that all of the one-way merges have been performed explicitly. If we had 
suppressed the copying operations, as in Algorithm 5.4.3C, we would have been 
confronted with the situation 


l D 2 D 2 D3D3D3 

after Pass 2, and it would have been impossible to continue with a three-way 
merge since we cannot merge runs that are in opposite directions! The operation 
of copying Tl to T2 could be avoided if we rewound Tl and proceeded to read 
it forward during the next merge phase (while reading T3 and T4 backwards). 
But it would then be necessary to rewind Tl again after merging, so this trick 
trades one copy for two rewinds. 

Thus the distribution method of Algorithm 5.4. 3C does not work as efficient- 
ly for read-backwards as for read-forwards; the amount of time required jumps 
rather sharply every time the number of initial runs passes a “perfect” cas- 
cade distribution number. Another dispersion technique can be used to give a 
smoother transition between perfect cascade distributions (see exercise 17). 

Read-backward polyphase. At first glance (and even at second and third 
glance) , the polyphase merge scheme seems to be totally unfit for reading back- 
wards. For example, suppose that we have 13 initial runs and three tapes: 

Tl T2 T3 

Phase 1 A 1 A 1 A 1 A 1 A 1 A 1 A 1 A 1 A 1 A 1 AiA i A 1 

Phase 2 — A 1 A 1 A 1 D 2 D 2 D 2 D 2 D 2 

Now we’re stuck; we could rewind either T2 or T3 and then read it forwards, 
while reading the other tape backwards, but this would jumble things up and 
we would have gained comparatively little by reading backwards. 


5.4.4 


READING TAPE BACKWARDS 301 


An ingenious idea that saves the situation is to alternate the direction of 
runs on each tape. Then the merging can proceed in perfect synchronization: 



T1 

T2 

T3 

Phase 1 

AiD\AiD\Ai 

DiA\D\A\D\A\D\Ai 

— 

Phase 2 

— 

D 1 A 1 D 1 

D 2 A 2 D 2 A 2 D 2 

Phase 3 

A 3 D 3 A 3 

— 

D 2 A 2 

Phase 4 

A 3 

d 5 a 5 

— 

Phase 5 



d 5 

d 8 

Phase 6 

• 1 1 3 

— 

— 


This principle was mentioned briefly by R. L. Gilstad in his original article on 
polyphase merging, and he described it more fully in CACM 6 (1963), 220-223. 

The ADA . . . technique works properly for polyphase merging on any num- 
ber of tapes; for we can show that the A’s and D's will be properly synchronized 
at each phase, provided only that the initial distribution pass produces alter- 
nating A’s and D’s on each tape and that each tape ends with A (or each tape 
ends with D ): Since the last run written on the output file during one phase is 
in the opposite direction from the last runs used from the input files, the next 
phase always finds its runs in the proper orientation. Furthermore we have seen 
in exercise 5.4.2-13 that most of the perfect Fibonacci distributions call for an 
odd number of runs on one tape (the eventual output tape), and an even number 
of runs on each other tape. If T1 is designated as the final output tape, we can 
therefore guarantee that all tapes end with an A run, if we start T1 with an A 
and let the remaining tapes start with a D. A distribution method analogous to 
Algorithm 5. 4. 2D can be used, modified so that the distributions on each level 
have T1 as the final output tape. (We skip levels 1, T+l, 2T+1, . . . , since they 
are the levels in which the initially empty tape is the final output tape.) For 
example, in the six-tape case, we can use the following distribution numbers in 
place of 5.4.2-(i): 


Final output 


Level 

T1 

T2 

T3 

T4 

T5 

Total 

will be on 

0 

1 

0 

0 

0 

0 

1 

T1 

2 

1 

2 

2 

2 

2 

9 

T1 

3 

3 

4 

4 

4 

2 

17 

T1 

4 

7 

8 

8 

6 

4 

33 

T1 

5 

15 

16 

14 

12 

8 

65 

T1 

6 

31 

30 

28 

24 

16 

129 

T1 

8 

61 

120 

116 

108 

92 

497 

T1 


Thus, T1 always gets an odd number of runs, while T2 through T5 get the even 
numbers, in decreasing order for flexibility in dummy run assignment. Such a 
distribution has the advantage that the final output tape is known in advance, 
regardless of the number of initial runs that happen to be present. It turns out 
(see exercise 3) that the output will always appear in ascending order on T1 
when this scheme is used. 


302 SORTING 


5.4.4 


Another way to handle the distribution for read-backward polyphase has 
been suggested by D. T. Goodwin and J. L. Venn [CACM 7 (1964), 315], We 
can distribute runs almost as in Algorithm 5. 4. 2D, beginning with a D run on 
each tape. When the input is exhausted, a dummy A run is imagined to be 
at the beginning of the unique “odd” tape, unless a distribution with all odd 
numbers has been reached. Other dummies are imagined at the end of the 
tapes, or grouped into pairs in the middle. The question of optimum placement 
of dummy runs is analyzed in exercise 5 below. 

Optimum merge patterns. So far we have been discussing various patterns 
for merging on tape, without asking for “best possible” methods. It appears 
to be quite difficult to determine the optimal patterns, especially in the read- 
forward case where the interaction of rewind time with merge time is hard to 
handle. On the other hand, when merging is done by reading backwards and 
writing forwards, all rewinding is essentially eliminated, and it is possible to 
get a fairly good characterization of optimal ways to merge. Richard M. Karp 
has introduced some very interesting approaches to this problem, and we shall 
conclude this section by discussing the theory he has developed. 

In the first place we need a more satisfactory way to describe merging 
patterns, instead of the rather mysterious tape-content tableaux that have been 
used above. Karp has suggested two ways to do this, the vector representation 
and the tree representation of a merge pattern. Both forms of representation are 
useful in practice, so we shall describe them in turn. 

The vector representation of a merge pattern consists of a sequence of “merge 
vectors” v/ m) . . .?/ (1) y (0 \ each of which has T components. The ith-last merge 
step is represented by yW in the following way: 

b) f +1 ’ if tape number 3 is an in P ut to tlle merge; 

Vj ~ ] if tape number j is not used in the merge; ( 2 ) 

l —1, if tape number j gets the output of the merge. 

Thus, exactly one component of y (l) is -1, and the other components are Os and 
Is. The final vector y {Q) is special; it is a unit vector, having 1 in position j if the 
final sorted output appears on unit j , and 0 elsewhere. These definitions imply 
that the vector sum 

v {l) = y {t) + + ■ ■ ■ + y ( °) ( 3 ) 

represents the distribution of runs on tape just before the ith-last merge step. 

w/th v- runs on tape j. In particular, v tells how many runs the initial 
distribution pass places on each tape. 

It may seem awkward to number these vectors backwards, with y (m) coming 
first and y (0) last, but this peculiar viewpoint turns out to be advantageous for 
developing the theory. One good way to search for an optimal method is to start 
with the sorted output and to imagine “unmerging” it to various tapes, then 
unmerging these, etc., considering the successive distributions v^°\ i/ 1 ), v^\ 
in the reverse order from which they actually occur during the sorting’ process. 


5.4.4 


READING TAPE BACKWARDS 303 


Iii fact that is essentially the approach we have taken already in our analysis of 
polyphase and cascade merging. 

The three merge patterns described in tabular form earlier in this section 
have the following vector representations: 


Balanced (T 

= 4 

l, S ■■ 

= 8) 

Cascade (T = 4, S = 

14) 

Polyphase (T = 3, S 

= 

v (7) 

= ( 4, 

4, 

0, 

0) 

v (io) 

= ( 6, 5, 3, 

0) 

t/ 12 > : 

= ( 5, 8, 

0) 

y (7) 

= (+1,+1, 

-1, 

0) 

y (10) 

= (+1, +1, +1, ■ 

-1) 

y {12) : 

= (Tl, Tl, - 

■1) 

y { 6) 

= (+1, +1, 

0, 

-1) 

y(9) 

= (+1, +1, +1, ■ 

-1) 

y {11) 

= ( + 1, Tl, - 

-1) 

y {5) 

= (+1, +1, 

-1, 

0) 

yW 

= (+1, +1, +1, 

-1) 

y (w) ■ 

= ( + 1, +1, “ 

-1) 

yW 

= (+1, +1, 

0, 

-1) 

y (7> 

= (+1, +1, — 1, 

0) 

y {9) 

= (Tl, Tl, - 

-1) 

yd) 

= (-l, 

0, 

+1) 

+1) 

y (6) 

= (+1, +1, —1, 

0) 

y W : 

= ( + 1, +1, “ 

T) 

y {2) 

= ( 0,- 

-1, 

+1, ' 

+1) 

y(V 

= (+i,-i, o, 

0) 

: 

= (“I. +1, Tl) 

y(l) 

— (+1, +1, 

-1, 

0) 

yW 

= ( — 1 1 +1, +1, 

+1) 

y (6) 

= ( — 1; +1) +1) 

y (0) 

= ( o, 

0, 

1, 

0) 

y {3) 

= ( 0,-1, +1, 

+ 1) 

y (5) 

= ( — 1, +1) +1) 






yW 

= ( 0, 0,-1, 

+ 1) 

y (4) 

= ( + 1, “I, +1) 






yW 

= (+1, +1, +1, 

-1) 

yW 

= (+1, — T +1) 






yW 

© 

o' 

© 

II 

1) 

y (2) 

— (+1, +1, - 

-1) 









yW 

— ( — 1) +1>+1) 









yi°) 

= ( 1, o, 

0) 


Every merge pattern obviously has a vector representation. Conversely, it is 
easy to see that the sequence of vectors y( m ^ . . . y ^ y corresponds to an actual 
merge pattern if and only if the following three conditions are satisfied: 

i) y <0 ) is a unit vector. 

ii) j/(*) has exactly one component equal to —1, all other components equal to 

0 or +1, for m > i > 1. 

iii) All components of yW + • • • + + y(°) are nonnegative, for m > i > 1. 

The tree representation of a merge pattern gives another picture of the same 
information. We construct a tree with one external leaf node for each initial 
run, and one internal node for each run that is merged, in such a way that the 
descendants of each internal node are the runs from which it was fabricated. 
Each internal node is labeled with the step number on which the corresponding 
run was formed, numbering steps backwards as in the vector representation; 
furthermore, the line just above each node is labeled with the name of the tape 
on which that run appears. For example, the three merge patterns above have 
the tree representations depicted in Fig. 76, if we call the tapes A, B, C, D 
instead of Tl, T2, T3, T4. 

This representation displays many of the relevant properties of the merge 
pattern in convenient form; for example, if the run on level 0 of the tree (the 
root) is to be ascending, then the runs on level 1 must be descending, those 
on level 2 must be ascending, etc.; an initial run is ascending if and only if the 
corresponding external node is on an even-numbered level. Furthermore the total 
number of initial runs processed during the merging (not including the initial 
distribution) is exactly equal to the external path length of the tree, since each 
initial run on level k is processed exactly k times. 


304 SORTING 


5.4.4 



Fig. 76. Tree representations of three merge patterns. 


Every merge pattern has a tree representation, but not every tree defines a 
merge pattern. A tree whose internal nodes have been labeled with the numbers 
1 through m, and whose lines have been labeled with tape names, represents a 
valid read-backward merge pattern if and only if 

a) no two lines adjacent to the same internal node have the same tape name; 

b) if i > j, and if A is a tape name, the tree does not contain the configuration 


© 



c) if i < j < k < l, and if A is a tape name, the tree does not contain 


© © 


both A 


and A 

© © 


© © 


both A 


and a 


© □ 


(4) 


5.4.4 


READING TAPE BACKWARDS 305 


Condition (a) is self-evident, since the input and output tapes in a merge must be 
distinct; similarly, (b) is obvious. The “no crossover” condition (c) mirrors the 
last-in-first-out restriction that characterizes read-backward operations on tape: 
The run formed at step k must be removed before any runs formed previously on 
that same tape; hence the configurations in (4) are impossible. It is not difficult 
to verify that any labeled tree satisfying conditions (a), (b), (c) does indeed 
correspond to a read-backward merge pattern. 

If there are T tape units, condition (a) implies that the degree of each 
internal node is T — 1 or less. It is not always possible to attach suitable labels 
to all such trees; for example, when T — 3 there is no merge pattern whose tree 
has the shape 


(5) 



This shape would lead to an optimal merge pattern if we could attach step 
numbers and tape names in a suitable way, since it is the only way to achieve 
the minimum external path length in a tree having four external nodes. But 
there is essentially only one way to do the labeling according to conditions (a) 
and (b), because of the symmetries of the diagram, namely, 



and this violates condition (c). A shape that can be labeled according to the 
conditions above, using at most T tape names, is called a T-lifo tree. 

Another way to characterize all labeled trees that can arise from merge 
patterns is to consider how all such trees can be “grown.” Start with some tape 
name, say A, and with the seedling 



Step number i in the tree’s growth consists of choosing distinct tape names 
B, B\, B2, . . . , Bk, and changing the most recently formed external node corre- 


306 SORTING 


5.4.4 


sponding to B 



(7) 


This “last formed, first grown on” rule explains how the tree representation can 
be constructed directly from the vector representation. 

The determination of strictly optimum T-tape merge patterns — that is, of 
T-lifo trees whose path length is minimum for a given number of external nodes — 
seems to be quite difficult. For example, the following nonobvious pattern turns 
out to be an optimum way to merge seven initial runs on four tapes, reading 
backwards: 



A one-way merge is actually necessary to achieve the optimum! (See exercise 8 .) 
On the other hand, it is not so difficult to give constructions that are asymptot- 
ically optimal, for any fixed T. 

Let K T (n) be the minimum external path length achievable in a T-lifo tree 
with n external nodes. From the theory developed in Section 2. 3. 4. 5, it is not. 
difficult to prove that 


k t(ti) >nq- [{(T -iy -n)/{T ~2)\, g=[log T _ 1 n], ( 9 ) 

since this is the minimum external path length of any tree with n external nodes 
and all nodes of degree < T. At the present time comparatively few values of 


Kr(n) are known 

exactly. Here are some upper bounds that 

are 

probably exact: 

n= 1 

2 

3 

4 5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

K :i (n) < 0 

2 

5 

9 12 

16 

21 

25 

30 

34 

39 

45 

50 

56 

61 ( 10 ) 

K 4 (n) < 0 

2 

3 

6 8 

11 

14 

17 

20 

24 

27 

31 

33 

37 

40 


Karp discovered that any tree whose internal nodes have degrees < T is 
almost T-lifo, in the sense that it can be made T-lifo by changing some of the 
external nodes to one-way merges. In fact, the construction of a suitable labeling 
is fairly simple. Let A be a particular tape name, and proceed as follows: 


5.4.4 


READING TAPE BACKWARDS 307 


Step 1. Attach tape names to the lines of the tree diagram, in any manner 
consistent with condition (a) above, provided that the special name A is used 
only in the leftmost line of a branch. 

Step 2. Replace each external node of the form 



□ 


whenever B ^ A. 

Step 3. Number the internal nodes of the tree in preorder. The result will be a 
labeling satisfying conditions (a), (b), and (c). 

For example, if we start with the tree 


(n) 


and three tapes, this procedure might assign labels as follows: 


(12) 


It is not difficult to verify that Karp’s construction satisfies the “last formed, 
first grown on” discipline, because of the nature of preorder (see exercise 12). 

The result of this construction is a merge pattern for which all of the initial 
runs appear on tape A. This suggests the following distribution and sorting 
scheme, which we may call the preorder merge: 




308 


SORTING 


5.4.4 


PI. Distribute initial runs onto Tape A until the input is exhausted. Let S be 
the total number of initial runs. 

P2. Carry out the construction above, using a minimum-path-length (T - 1)- 
ary tree with S external nodes, obtaining a T-lifo tree whose external path 
length is within S of the lower bound in ( 9 ). 

P3. Merge the runs according to this pattern. | 

This scheme will produce its output on any desired tape. But it has one serious 
flaw— -does the reader see what will go wrong? The problem is that the merge 
pattern requires some of the runs initially on tape A to be ascending, and some to 
be descending, depending on whether the corresponding external node appears 
on an odd or an even level. This problem can be resolved without knowing S 
in advance, by copying runs that should be descending onto an auxiliary tape 
or tapes, just before they are needed. Then the total amount of processing, in 
terms of initial run lengths, comes to 

S log T _! S + O(S). ( 13 ) 

Thus the preorder merge is definitely better than polyphase or cascade, as 
S —> 00 ; indeed, it is asymptotically optimum, since ( 9 ) shows that 5 , log T _ 1 5 + 
O(S) is the best we could ever hope to achieve on T tapes. On the other 
hand, for the comparatively small values of S that usually arise in practice, the 
preorder merge is rather inefficient.; polyphase or cascade methods are simpler 
and faster, when S is reasonably small. Perhaps it will be possible to invent a 
simple distribution-and-merge scheme that is competitive with polyphase and 
cascade for small 5, and that is asymptotically optimum for large S. 

The second set of exercises below shows how Karp has formulated the 
question of read-forward merging in a similar way. The theory turns out to 
be rather more complicated in this case, although some very interesting results 
have been discovered. 

EXERCISES — First Set 

1. [17] It is often convenient, during read-forward merging, to mark the end of each 
run on tape by including an artificial sentinel record whose key is + 00 . How should 
this practice be modified, when reading backwards? 

2. [20] Will the columns of an array like ( 1 ) always be nondecreasing, or is there a 
chance that we will have to “subtract” runs from some tape as we go from one level to 
the next? 

► 3. [20] Prove that when read-backward polyphase merging is used with the perfect 

distributions of ( 1 ), we will always obtain an A run on tape T 1 when sorting is complete, 
if T1 originally starts with ADA . . . and T2 through T5 start with DAD 

4. [M22] Is it a good idea to do read-backward polyphase merging after distributing 
all runs in ascending order, imagining all the D positions to be initially filled with 
dummies? 

► 5. [23] What formulas for the strings of merge numbers replace ( 8 ), ( 9 ), ( 10 ), and 
( 11 ) of Section 5.4.2, when read-backward polyphase merging is used? Show the 


5 . 4.4 


READING TAPE BACKWARDS 309 


merge numbers for the fifth level distribution on six tapes, by drawing a diagram 
like Fig. 71(a). 

6. [0 7] What is the vector representation of the merge pattern whose tree represen- 
tation is (8)? 

7. [16] Draw the tree representation for the read-backward merge pattern defined 
by the following sequence of vectors: 

v (33) = ( 20, 9 , 5) 

y (33) = ( +1) _i )+1 ) 

y (32) = (+ 1 ,+ 1 ,- 1 ) 
y (31) = (+1,+1,-1) 
y (30) = (+i,+i,-i) 
y (29) = (+1,-1, +i) 
y (28) = (-i,+i,+i) 
y (27) = (+1,-1, +i) 
y (26 > = (+ 1 ,- 1 , + 1 ) 
y (25) = (+i, +i,-i) 
y (24) = (+1,-1, +i) 

y (23) = (+1,-1, +i) 
y (22 ] = ( +lj _ lj+1 ) 

y (ai) = (-i,+i, +i) 
y m = (+i, +i, -i) 
y (19) = (- 1 , +i, +i) 
y (18) = (+i,+i,-i) 
y {17) = (+i,+i, — i) 

8. [23] Prove that (8) is an optimum way to merge, reading backwards, when S = 7 
and T = 4, and that all methods that avoid one-way merging are inferior. 

9. [M22] Prove the lower bound (g). 

10. [41] Prepare a table of the exact values of Kr{n), using a computer. 

► 11. [20] True or false: Any read-backward merge pattern that uses nothing but 
(T — l)-way merging must always have the runs alternating AD AD ... on each tape; 
it will not work if two adjacent runs appear in the same order. 

12. [22] Prove that Karp’s preorder construction always yields a labeled tree satisfy- 
ing conditions (a), (b), and (c). 

13. [16] Make ( 12 ) more efficient, by removing as many of the one-way merges as 
possible so that preorder still gives a valid labeling of the internal nodes. 

14. [40] Devise an algorithm that carries out the preorder merge without explicitly 
representing the tree in steps P2 and P3, using only O(logS) words of memory to 
control the merging pattern. 

15. [M3 9] Karp’s preorder construction in the text yields trees with one-way merges at 
several terminal nodes. Prove that when T = 3 it is possible to construct asymptotically 
optimal 3-lifo trees in which two-way merging is used throughout. 

In other words, let Kr(n) be the minimum external path length over all T-lifo 
trees with n external nodes, such that every internal node has degree T — 1. Prove that 
Ks(n) = nlgn + O(n). 

16. [M 46 ] In the notation of exercise 15, is Kt{ti) = nlog r _j n + 0(n) for all T > 3, 
when n = 1 (modulo T — 2)1 


y (16) = (+ 1 ,+ 1 ,- 1 ) 
2 / (15) = (+ 1 , + 1 , - 1 ) 
» (14) = (+l,-l,+l) 

y (13) = (+1,-1, +1) 
y <12) = (-!,+!,+!) 
y (11) = (+1, +1, -1) 
„ (10 > = (+!,+!,— !) 
y (9) =(+i,-i,+i) 
y w = (+1, +1, -1) 

y (7) =(+l,+l,-l) 

y (6) = (+1, +1, -1) 
r/ (5) = (—1, + 1 , + 1 ) 

y w = (+ 1 , - 1 , + 1 ) 

y {3) = (—1, +1, +1) 

y w = (+ 1 , - 1 , + 1 ) 
y w =(-l,+l,+l) 
2 / (0) =( 1 , 0 , 0 ) 


310 SORTING 


5.4.4 



► 17 . [28] (Richard D. Pratt.) To achieve ascending order in a read-backward cascade 
merge, we could insist on an even number of merging passes; this suggests a technique 
of initial distribution that is somewhat different from Algorithm 5.4.3C. 

a) Change 5.4.3 - (i) so that it shows only the perfect distributions that require an 
even number of merging passes. 

b) Design an initial distribution scheme that interpolates between these perfect dis- 
tributions. (Thus, if the number of initial runs falls between perfect distributions, 
it is desirable to merge some, but not all, of the runs twice, in order to reach a 
perfect distribution.) 

► 18 . [M38] Suppose that T tape units are available, for some T > 3, and that T1 
contains N records while the remaining tapes are empty. Is it possible to reverse the 
order of the records on T1 in fewer than Q(N log N) steps, without reading backwards? 
(The operation is, of course, trivial if backwards reading is allowed.) See exercise 
5.2.5-14 for a class of such algorithms that do require order NlogN steps. 

EXERCISES — Second Set 

The following exercises develop the theory of tape merging on read-forward tapes; in 
this case each tape acts as a queue instead of as a stack. A merge pattern can be 
represented as a sequence of vectors y (m) . . . exactly as in the text, but when 

we convert the vector representation to a tree representation we change “last formed, 
first grown on” to “ first formed, first grown on.” Thus the invalid configurations (4) 
would be changed to 



A tree that can be labeled so as to represent a read-forward merge on T tapes is called 
T-fifo, analogous to the term “T-lifo” in the read-backward case. 

When tapes can be read backwards, they make very good stacks. But unfortu- 
nately they don’t make very good general-purpose queues. If we randomly write and 
read, in a first-in-first-out. manner, we waste a lot of time moving from one part of the 
tape to another. Even worse, we will soon run off the end of the tape! We run into the 
same problem as the queue overrunning memory in 2.2.2-(4) and (5), but the solution 
in 2.2.2-(6) and (7) doesn’t apply to tapes since they aren’t circular loops. Therefore 
we shall call a tree strongly T -fifo if it can be labeled so that the corresponding merge 
pattern makes each tape follow the special queue discipline “write, rewind, read all. 
rewind; write, rewind, read all, rewind; etc.” 

► 19 . [22] (R. M. Karp.) Find a binary tree that is not 3-fifo. 

► 20 . [22] Formulate the condition “strongly T-fifo” in terms of a fairly simple rule 
about invalid configurations of tape labels, analogous to (4'). 

21 . [18] Draw the tree representation for the read- forwards merge pattern defined by 
the vectors in exercise 7. Is this tree strongly 3-fifo? 

22. [28] (R. M. Karp.) Show that the tree representations for polyphase and cascade 
merging with perfect distributions are exactly the same for both the read-backward 
and the read-forward case, except for the numbers that label the internal nodes. Find 
a larger class of vector representations of merging patterns for which this is true. 


5.4.5 


THE OSCILLATING SORT 311 


23. [24] (R. M. Karp.) Let us say that a segment ■y^ q ' 1 . . . y ^ of a merge pattern is a 
stage if no output tape is subsequently used as an input tape — that is, if there do not 
exist i, j, k with q > i > k > r, y = —1, and y ^ = +1. The purpose of this exercise 
is to prove that cascade merge minimizes the number of stages, over all merge patterns 
having the same number of tapes and initial runs. 

It is convenient to define some notation. Let us write v -A w if v and w are T- 
vectors such that w reduces to v in the first stage of some merge pattern. (Thus there 
is a merge pattern y t ' m ' > . . . j/ <0) such that is a stage, w = y ^ + ■ ■ ■ + r/ (0 \ 

and v = y l ' l) + •••-)- f/°f) Let us write v < w if u and w are T-vectors such that 
the sum of the largest k elements of v is < the sum of the largest k elements of w, for 
l<k<T. Thus, for example, (2, 1, 2, 2, 2, 1) X (1, 2, 3, 0, 3, 1), since 2 < 3, 2+2 < 3+3, 
.... 2 + 2 + 2 + 2 + 1 + 1 < 3 + 3 + 2 + 1 + 1+0. Finally, if v — (tq, . . . , r+), let 
G(v) = (st, st- 2 , st- 3 , ■ ■ . ,si,0) where s k is the sum of the largest k elements of v. 

a) Prove that v — > C(v). 

b) Prove that v < w implies C(v ) + C(w). 

c) Assuming the result of exercise 24, prove that cascade merge minimizes the number 
of stages. 

24. [ M35 } In the notation of exercise 23, prove that v — 1 w implies w + C(v). 

25. [M36] (R. M. Karp.) Let us say that a segment r/' 3) . . . y ^ of a merge pattern 
is a phase if no tape is used both for input and for output — that is, if there do not 
exist t, j, k with q>i>r,q>k>r, j/)*' = +1, and = —1. The purpose of this 
exercise is to investigate merge patterns that minimize the number of phases. We shall 
write v => w if w can be reduced to v in one phase (a similar notation was introduced 
in exercise 23); and we let 

D k (v) = (s k + t k + 1 , Sfc+ffc+ 2 , . . . , s k + tr, 0, . . . , 0), 

where tj denotes the jth largest element of v and s k — ti + • • • + t k . 

a) Prove that v => D k (v) for 1 < k < T. 

b) Prove that v + w implies D k (v) + D k (w), for 1 < k < T. 

c) Prove that v =+ w implies w + D k (v), for some k, 1 < k < T. 

d) Consequently, a merge pattern that sorts the maximum number of initial runs on 
T tapes in q phases can be represented by a sequence of integers k\ L ’2 . . . k q , such 
that the initial distribution is D kq (. . . (D k 2 (D kl (u))) . . . ), where u = (1,0, .. . ,0). 
This minimum-phase strategy has a strongly T-fifo representation, and it also 
belongs to the class of patterns in exercise 22. When T = 3 it is the polyphase 
merge, and for T = 4, 5, 6, 7 it is a variation of the balanced merge. 

26. [ M 46 ) (R. M. Karp.) Is the optimum sequence k\ L ’2 ... k q mentioned in exercise 25 
equal to 1 \T / 2] [T/2J [T/2] [T/2J . . . , for all T > 4 and all sufficiently large q ? 

*5.4.5. The Oscillating Sort 

A somewhat different approach to merge sorting was introduced by Sheldon 
Sobel in JACM 9 (1962), 372-375. Instead of starting with a distribution pass 
where all the initial runs are dispersed to tapes, he proposed an algorithm that 
oscillates back and forth between distribution and merging, so that much of the 
sorting takes place before the input has been completely examined. 


312 SORTING 


5.4.5 



Suppose, for example, that there are five tapes available for merging. Sobel’s 
method would sort 16 initial runs as follows: 



Operation 

T1 

T2 

T3 

T4 

T5 

Cost 

Phase 1 

Distribute 

Ai 

Ai 

A! 

Ai 

— 

4 

Phase 2 

Merge 

— 

— 


— 

d 4 

4 

Phase 3 

Distribute 

— 

Tli 

A 1 

Ai 

D 4 Ai 

4 

Phase 4 

Merge 

d 4 

— 

— 

— 

d 4 

4 

Phase 5 

Distribute 

D 4 Ai 

— 

Ai 


D 4 A\ 

4 

Phase 6 

Merge 

d 4 

D 4 

— 

— 

d 4 

4 

Phase 7 

Distribute 

D 4 A\ 

D 4 A\ 

— 

A! 

D 4 A\ 

4 

Phase 8 

Merge 

d 4 

d 4 

d 4 

* 4 * ■ 

d 4 

4 

Phase 9 

Merge 


— 

— 

^4-16 

— 

16 


Here, as in Section 5.4.4, we use A r and D r to stand respectively for ascending 
and descending runs of relative length r. The method begins by writing an initial 
run onto each of four tapes, and merges them (reading backwards) onto the fifth 
tape. Distribution resumes again, this time cyclically shifted one place to the 
right with respect to the tapes, and a second merge produces another run D 4 . 
When four D 4 s have been formed in this way, an additional merge creates A w . 
We could go on to create three more Ai 6 ’s, merging them into a D 6 4 , and so on 
until the input is exhausted. It isn’t necessary to know the length of the input 
in advance. 

When the number of initial runs, S, is 4 m , it is not difficult to see that this 
method processes each record exactly m + 1 times: once during the distribution, 
and m times during a merge. When 5 is between 4 m_1 and 4 m , we could assume 
that dummy runs are present, bringing 5 up to 4 m ; hence the total sorting time 
would essentially amount to [log 4 S] + 1 passes over all the data. This is just 
what would be achieved by a balanced sort on eight tapes; in general, oscillating 
sort with T work tapes is equivalent to balanced merging with 2(T-1) tapes, 
since it makes 

[log T _ x S] + 1 

passes over the data. When 5 is a power of T — 1, this is the best any T-tape 
method could possibly do, since it achieves the lower bound in Eq. 5.4.4-(g). On 
the other hand, when 5 is 

(T — l) m ~ 1 + 1 , 

just one higher than a power of T — 1, the method wastes nearly a whole pass. 

Exercise 2 shows how to eliminate part of this penalty for non-perfect- 
powers S, by using a special ending routine. A further refinement was discovered 
in 1966 by Dennis L. Bencher, who called his procedure the “criss-cross merge” 
[see H. Wedekind, Datenorganisation (Berlin: W. de Gruyter, 1970), 164-166; 
see also U.S. Patent 3540000 (1970)]. The main idea is to delay merging until 
more knowledge of S has been gained. We shall discuss a slightly modified form 
of Bencher’s original scheme. 


5.4.5 


THE OSCILLATING SORT 


313 


This improved oscillating sort proceeds as follows: 



Operation 

T1 

T2 

T3 

T4 

T5 

Cost 

Phase 1 

Distribute 

— 

Aa 

Tli 

Aa 

Tli 

4 

Phase 2 

Distribute 

— 

Aa 

AaAi 

AaAi 

AaAi 

3 

Phase 3 

Merge 

d 4 

— 

Aa 

Aa 

Aa 

4 

Phase 4 

Distribute 

D4A1 

— 

Aa 

AaAi 

AaAa 

3 

Phase 5 

Merge 

d 4 

Da 

— - 

Aa 

A\ 

4 

Phase 6 

Distribute 

D4AI 

D 4 A\ 

— 

Aa 

AaAa 

3 

Phase 7 

Merge 

d a 

Da 

D 4 

— 

Aa 

4 

Phase 8 

Distribute 

d 4 ai 

D 4 Ai 

D4A1 

— . 

Aa 

3 

Phase 9 

Merge 

D a 

Da 

Da 

Da 

— 

4 

We do not merge the D^s into an Aie at this point 
to be exhausted); only after building up to 

(unless 

the input 

happens 

Phase 15 

Merge 

D4D4 

D4D4 

D4 D4 

Da 

— 

4 

will we get 

Phase 16 

Merge 

Da 

Da 

Da 

— 

Aa6 

16 

The second 

will occur after three more 

Da's have been made, 


Phase 22 

Merge 

D4D4 

D4D4 

Da 

— 

Aa6Da 

4 

Phase 23 

Merge 

D 4 

D 4 


^16 

Aa6 

16 


and so on (compare with Phases 1-5). The advantage of Bencher’s scheme can be 
seen for example if there are only five initial runs: Oscillating sort as modified 
in exercise 2 would do a four-way merge (in Phase 2) followed by a two-way 
merge, for a total cost of 4 + 4 + 1 + 5 = 14, while Bencher’s scheme would do 
a two-way merge (in Phase 3) followed by a four-way merge, for a total cost of 
4+l+2+5= 12. Both methods also involve a small additional cost, namely 
one unit of rewind before the final merge. 

A precise description of Bencher’s method appears in Algorithm B below. 
Unfortunately it seems to be a procedure that is harder to understand than to 
code; it is much easier to explain the technique to a computer than to a computer 
scientist! This is partly because it is an inherently recursive method that has 
been expressed in iterative form and then optimized somewhat; the reader may 
find it necessary to trace through the operation of this algorithm several times 
before discovering what is really going on. 

Algorithm B ( Oscillating sort with “criss-cross” distribution). This algorithm 
takes initial runs and disperses them to tapes, occasionally interrupting the 
distribution process in order to merge some of the tape contents. The algorithm 
uses P - way merging, assuming that T = P + 1 > 3 tape units are available — 
not counting the unit that may be necessary to hold the input data. The tape 
units must allow reading in both forward and backward directions, and they are 
designated by the numbers 0, 1, . . . , P. The following tables are maintained: 


314 SORTING 


5.4.5 



D [ j] , 0 < j < P: Number of dummy runs assumed to be present at the end of 
tape j. 

ALU! , 0 < l < L, Here L is a number such that at most P L+1 initial runs will 
0 < j < P be input. When A [Z,j] = k > 0, a run of nominal length 
P k is present on tape j, corresponding to “level l ” of the 
algorithm’s operation. This run is ascending if k is even, 
descending if k is odd. When A [/, j] < 0, level l does not use 
tape j. 

The statement “Write an initial run on tape j” is an abbreviation for the 

following operations: 

Set A[U'] «- 0 . If the input is exhausted, increase D[j] by 1 ; otherwise 
write an initial run (in ascending order) onto tape j. 

The statement “Merge to tape j” is an abbreviation for the following operations: 
If D[i] > 0 for all i ^ j, decrease D [i] by 1 for all i 7 ^ j and increase D[j] 
by 1 . Otherwise merge one run to tape j, from all tapes i ^ j such that 
D[i] = 0, and decrease D[i] by 1 for all other t/j. 



Fig. 77. Oscillating sort, with a “criss-cross” distribution. 


Bl. [Initialize.] Set D[j] <- 0 for 0 < j < P. Set A [0,0] 4 1, l 4 - 0, q 4 - 0. 

Then write an initial run on tape j, for 1 < j < P. 

B2. [Input complete?] (At this point tape q is empty and the other tapes contain 
at most one run each.) If there is more input, go on to step B3. But if 
the input is exhausted, rewind all tapes j / q such that A[0,j] is even: 
then merge to tape q, reading forwards on tapes just rewound, and reading 
backwards on the other tapes. This completes the sort, with the output in 
ascending order on tape q. 

B3. [Begin new level.] Set l 4 - Z + 1, r 4- q, s 4- 0, and q 4 - (q + 1) mod T. 

Write an initial run on tape (q + j) mod T, for 1 < j < T - 2. (Thus an 

initial run is written onto each tape except tapes q and r.) Set A [Z, g] 4 1 

and A[Z,r] 4 2. 

B4. [Ready to merge?] If A[Z-l,g] ^ s, go back to step B3. 








5.4.5 


THE OSCILLATING SORT 315 


B5. [Merge.] (At this point A[Z-l,g] = A Ll,jl = s for all j ^ q, j ^ r.) 
Merge to tape r, reading backwards. (See the definition of this operation 

above.) Then set s 4— s + 1, Z 4— Z — 1, A [Z, r] 4— s, and A [Z,g] 4 1. Set 

r 4— (2 q — r) mod T. (In general, we have r = (g — 1) mod T when s is even, 
r = [q + 1) mod T when s is odd.) 

B6. [Is level complete?] If l = 0, go to B2. Otherwise if A [Z, j] = s for all j ± q 
and j ^ r, go to B4. Otherwise return to B3. | 

We can use a “recursion induction” style of proof to show that this al- 
gorithm is valid, just as we have done for Algorithm 2.3. IT. Suppose that 
we begin at step B3 with Z = Z 0 , q = q 0 , s + — A [Z 0 , (go+1) mod T] , and 
s- = A [Zo, (go — 1) mod T] ; and assume furthermore that either s + = 0 or s_ = 1 
or s + = 2 or s_ = 3 or • • • . It is possible to verify by induction that the algorithm 
will eventually get to step B5 without changing rows 0 through Iq of A, and with 
l — Zo + 1, q = go ± 1, r = go, and s — s + or s_, where we choose the + sign if 
= 0 or (s+ = 2 and s_ / 1) or (s + = 4 and s_ 1, 3) or • • • , and we choose 
the - sign if (s_ = 1 and s + / 0) or (s_ = 3 and s + / 0, 2) or • • • . The proof 
sketched here is not very elegant, but the algorithm has been stated in a form 
more suited to implementation than to verification. 

Figure 78 shows the efficiency of Algorithm B, in terms of the average num- 
ber of times each record is merged as a function of the number S of initial runs, 
assuming that the initial runs are approximately equal in length. (Corresponding 
graphs for polyphase and cascade sort have appeared in Figs. 70 and 74.) A slight 
improvement, mentioned in exercise 3, has been used in preparing this chart. 

A related method called the gyrating sort was developed by R. M. Karp, 
based on the theory of preorder merging that we have discussed in Section 5.4.4; 
see Combinatorial Algorithms, edited by Randall Rustin (Algorithmics Press, 
1972), 21-29. 

Reading forwards. The oscillating sort pattern appears to require a read- 
backwards capability, since we need to store long runs somewhere as we merge 
newly input short runs. However, M. A. Goetz [Proc. AFIPS Spring Joint 
Comp. Conf. 25 (1964), 599-607] has discovered a way to perform an oscillating 
sort using only forward reading and simple rewinding. His method is radically 
different from the other schemes we have seen in this chapter, in two ways: 

a) Data is sometimes written at the front of the tape, with the understanding 
that the existing data in the middle of the tape is not destroyed. 

b) All initial runs have a fixed maximum length. 

Condition (a) violates the first-in-first-out property we have assumed to be 
characteristic of forward reading, but it can be implemented reliably if a sufficient 
amount of blank tape is left between runs and if parity errors are ignored at 
appropriate times. Condition (b) tends to be somewhat incompatible with an 
efficient use of replacement selection. 

Goetz’s read- forward oscillating sort has the somewhat dubious distinction 
of being one of the first algorithms to be patented as an algorithm instead of as a 


316 SORTING 


5.4.5 


a physical device [U.S. Patent 3380029 (1968)]; between 1968 and 1988, no one in 
the U.S. A. could legally use the algorithm in a program without permission of the 
patentee. Bencher’s read-backward oscillating sort technique was patented by 
IBM several years later. [Alas, we have reached the end of the era when the joy of 
discovering a new algorithm was satisfaction enough! Fortunately the oscillating 
sort isn’t especially good; let’s hope that community-minded folks who invent 
the best algorithms continue to make their ideas freely available. Of course the 
specter of people keeping new techniques completely secret is far worse than the 
public appearance of algorithms that are proprietary for a limited time.] 

The central idea in Goetz’s method is to arrange things so that each tape 
begins with a run of relative length 1, followed by one of relative length P, then 
P 2 , etc. For example, when T = 5 the sort begins as follows, using to 
indicate the current position of the read-write head on each tape: 




Operation 

Tl 

T2 T3 T4 

T5 

“Cost’ 

Remarks 

Phase 

1 

Distribute 

■Ai 

.Ai .A\ .A\ 

Ax. 

5 

[T5 not rewound] 

Phase 

2 

Merge 

#i- 

*r *i- 

Ax A4. 

4 

[Now rewind all] 

Phase 

3 

Distribute 

■Ax 

.A\ .A\ A\. 

■Ax A4 

4 

[T4 not rewound] 

Phase 

4 

Merge 

Ax- 

-^l ^ 4 - 

Ax-A 4 

4 

[Now rewind all] 

Phase 

5 

Distribute 

■ Ax 

•A\ A\. .A\ A4 

■Ax A 4 

4 

[T3 not rewound] 

Phase 

6 

Merge 

Ax- 

A\ A4. 

Ax-^4 

4 

[Now rewind all] 

Phase 

7 

Distribute 

■Ax 

A\. ,A\ A4 .A\ A4 

■ Ax A 4 

4 

[T2 not rewound] 

Phase 

8 

Merge 

Ax- 

A\ A4. )^ 1 .A4 )^ 1 .A4 

Ax- A 4 

4 

[Now rewind all] 

Phase 

9 

Distribute 

Ax. 

• Ai A4 .Ai A4 .A\ A4 

■ Ax A 4 

4 

[Tl not rewound] 

Phase 

10 

Merge 

A1A4. 

$ v A 4 

Ax-A 4 

4 

[No rewinding] 

Phase 11 

Merge AxA 4 Ai 6 . 

*1 K *1 *4- *1 *4- 

^X *4' 

16 

[Now rewind all] 

And so on. During Phase 1. 

Tl was rewinding while T2 

was receiving its input, 


then T2 was rewinding while T3 was receiving input, etc. Eventually, when the 
input is exhausted, dummy runs will start to appear, and we will sometimes 
need to imagine that they were written explicitly on the tape at full length. For 
example, if S = 18, the Ai’s on T4 and T5 would be dummies during Phase 9; 
we would have to skip forwards on T4 and T5 while merging from T2 and T3 
to T1 during Phase 10, because we have to get to the A 4 ’s on T4 and T5 in 
preparation for Phase 11. On the other hand, the dummy Ax on T1 need not 
appear explicitly. Thus the “endgame” is a bit tricky. 

Another example of this method appears in the next section. 

EXERCISES 

1. [22] The text illustrates Sobel’s original oscillating sort for T = 5 and S = 16. 
Give a precise specification of an algorithm that generalizes the procedure, sorting 
S = P L initial runs onT = P+ l>3 tapes. Strive for simplicity. 

2. [24] If S = 6 in Sobel’s original method, we could pretend that S = 16 and that 
10 dummy runs were present. Then Phase 3 in the text’s example would put dummy 
runs A 0 on T4 and T5; Phase 4 would merge the Ax’s on T2 and T3 into a D 2 on Tl; 
Phases 5-8 would do nothing; and Phase 9 would produce A 6 on T4. It would be better 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 317 



T = 3 


T = 4 

T = 5 
T = 6 
T = 8 
T- 10 


1 2 5 10 20 50 100 200 500 1000 2000 5000 

Initial runs, S 

Fig. 78. Efficiency of oscillating sort, using the technique of Algorithm B and exercise 3. 


to rewind T2 and T3 just after Phase 3, then to produce As immediately on T4 by 
three-way merging. 

Show how to modify the algorithm of exercise 1, so that an improved ending like 
this is obtained when S is not a perfect power of P. 

► 3. [29] Prepare a chart showing the behavior of Algorithm B when T = 3, assuming 
that there are nine initial runs. Show that the procedure is obviously inefficient in one 
place, and prescribe corrections to Algorithm B that will remedy the situation. 

4. [21] Step B3 sets kll,q] and A[l,r] to negative values. Show that one of these 
two operations is always superfluous, since the corresponding A table entry is never 
looked at. 

5. [M25] Let S be the number of initial runs present in the input to Algorithm B. 
Which values of S require no rewinding in step B2? 

*5.4.6. Practical Considerations for Tape Merging 

Now comes the nitty-gritty: We have discussed the various families of merge 
patterns, so it is time to see how they actually apply to real configurations of 
computers and magnetic tapes, and to compare them in a meaningful way. Our 
study of internal sorting showed that we can’t adequately judge the efficiency of a 
sorting method merely by counting the number of comparisons it performs; sim- 
ilarly we can’t properly evaluate an external sorting method by simply knowing 
the number of passes it makes over the data. 



318 SORTING 


5.4.6 


In this section we shall discuss the characteristics of typical tape units, and 
the way they affect initial distribution and merging. In particular we shall study 
some schemes for buffer allocation, and the corresponding effects on running 
time. We also shall consider briefly the construction of sort generator programs. 

How tape works. Different manufacturers have provided tape units with widely 
varying characteristics. For convenience, we shall define a hypothetical MIXT tape 
unit, which is reasonably typical of the equipment that was being manufactured 
at the time this book was first written. MIXT reads and writes 800 characters per 
inch of tape, at a rate of 75 inches per second. This means that one character 
is read or written every ^ ms, or 16§ microseconds, when the tape is active. 
Actual tape units that were available in 1970 had densities ranging from 200 to 
1600 characters per inch, and tape speeds ranging from 37| to 150 inches per 
second, so their effective speed varied from 1/8 to 4 times as fast as MIXT. 

Of course, we observed near the beginning of Section 5.4 that magnetic tapes 
in general are now pretty much obsolete. But many lessons were learned during 
the decades when tape sorting was of major importance, and those lessons are 
still valuable. Thus our main concern here is not to obtain particular answers; it 
is to learn how to combine theory and practice in a reasonable way. Methodology 
is much more important than phenomenology, because the principles of problem 
solving remain useful despite technological changes. Readers will benefit most 
from this material by transplanting themselves temporarily into the mindset of 
the 1970s. Let us therefore pretend that we still live in that bygone era. 

One of the important considerations to keep in mind, as we adopt the 
perspective of the early days, is the fact that individual tapes have a strictly 
limited capacity. Each reel contains 2400 feet of tape or less; hence there is 
room for at most 23,000,000 or so characters per reel of MIXT tape, and it takes 
about 23000000/3600000 « 6.4 minutes to read them all. If larger files must be 
sorted, it is generally best to sort one reelful at a time, and then to merge the 
individually sorted reels, in order to avoid excessive tape handling. This means 
that the number of initial runs, S, actually present in the merge patterns we have 
been studying is never extremely large. We will never find 5 > 5000, even with a 
very small internal memory that produces initial runs only 5000 characters long. 
Consequently the formulas that give asymptotic efficiency of the algorithms as 
S -4 oo are primarily of academic interest. 

Data appears on tape in blocks (Fig. 79), and each read/write instruction 
transmits a single block. Tape blocks are often called “records,” but we shall 
avoid that terminology because it conflicts with the fact that we are sorting a 
file of “records” in another sense. Such a distinction was unnecessary on many 
of the early sorting programs written during the 1950s, since one record was 
written per block; but we shall see that it is usually advantageous to have quite 
a few records in every block on the tape. 

An interblock gap , 480 character positions long, appears between adjacent 
blocks, in order to allow the tape to stop and to start between individual read 
or write commands. The effect of interblock gaps is to decrease the number of 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 319 



characters per reel of tape, depending on the number of characters per block (see 
Fig. 80); and the average number of characters transmitted per second decreases 
in the same way, since tape moves at a fairly constant speed. 



01 I I I 1 I— 

0 1000 2000 3000 4000 5000 


Characters per block 

Fig. 80. The number of characters per reel of MIXT tape, as a function of the block size. 

Many old-fashioned computers had fixed block sizes that were rather small; 
their design was reflected in the MIX computer as defined in Chapter 1, which 
always reads and writes 100-word blocks. But Mix’s convention corresponds to 
about 500 characters per block, and 480 characters per gap, hence almost half 
the tape is wasted! Most machines of the 1970s therefore allowed the block size 
to be variable; we shall discuss the choice of appropriate block sizes below. 

At the end of a read or write operation, the tape unit “coasts” at full speed 
over the first 66 characters (or so) of the gap. If the next operation for the same 
tape is initiated during this time, the tape motion continues without interruption. 
But if the next operation doesn’t come soon enough, the tape will stop and it 
will also require some time to accelerate to full speed on the next operation. The 
combined stop/start time delay is 5 ms, 2 for the stop and 3 for the start (see 
Fig. 81). Thus if we just miss the chance to have continuous full-speed reading, 
the effect on running time is essentially the same as if there were 780 characters 
instead of 480 in the interblock gap. 

Now let us consider the operation of rewinding. Unfortunately, the exact 
time needed to rewind over a given number n of characters is not easy to 
characterize. On some machines there is a high-speed rewind that applies only 
when n is greater than 5 million or so; for smaller values of n, rewinding goes at 



320 


SORTING 


5.4.6 


cs 

<V 

£ B 

03 

is\o 


b A 
«3 o 

43 

03 <D 

\ Cl 
O +3 


- 

^ 

Minimum stop/start delay 
for noncontinuous read/ write 



Continuous read/wri 
command is initiated 

1 i 1 

te is possible if the 
soon enough, on the same tape 

1 1 i i 


012345678 
Time from completion of previous operation to initiation 
of next command to tape controller (ms) 

Fig. 81. How to compute the stop/start delay time. (This gets added to the time used 
for reading or writing the blocks and the gaps.) 


normal read/write speed. On other machines a special motor is used to control 
all of the rewind operations; it gradually accelerates the tape reel to a certain 
number of revolutions per minute, then puts on the brakes when it is time to 
stop, and the actual tape speed varies with the fullness of the reel. For simplicity, 
we shall assume that MIXT requires max(30, n/150) ms to rewind over n character 
positions (including gaps), roughly two-fifths as long as it took to write them. 
This is a reasonably good approximation to the behavior of many actual tape 
units, where the ratio of read/write time to rewind time is generally between 2 
and 3, but it does not adequately model the effect of combined low-speed and 
high-speed rewind that is present on many other machines. (See Fig. 82.) 

Initial loading and/or rewinding will position a tape at “load point,” and an 
extra 110 ms are necessary for any read or write operation initiated at load point. 
When the tape is not at load point, it may be read backwards; an extra 32 ms is 
added to the time of any backward operation following a forward operation or 
any forward operation following a backward one. 



0 5,000,000 15,000,000 23,000,000 


Number of characters from load point 


Fig. 82. Approximate running time for two commonly used rewind techniques. 








5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 321 


Merging revisited. Let us now look again at the process of P- way merging, 
with an emphasis on input and output activities, assuming that P+1 tape units 
are being used for the input files and the output file. Our goal is to overlap 
the input /output operations as much as possible with each other and with the 
computations of the program, so that the overall merging time is minimized. 

It is instructive to consider the following special case, in which serious 
restrictions are placed on the amount of simultaneity possible. Suppose that 

a) at most one tape may be written on at any one time; 

b) at most one tape may be read from at any one time; 

c) reading, writing, and computing may take place simultaneously only when 
the read and write operations have been initiated simultaneously. 

It turns out that a system of 2 P input buffers and 2 output buffers is sufficient 
to keep the tape moving at essentially its maximum speed, even though these 
three restrictions are imposed, unless the computer is unusually slow. Note that 
condition (a) is not really a restriction, since there is only one output tape. 
Furthermore the amount of input is equal to the amount of output, so there is 
only one tape being read, on the average, at any given time; if condition (b) is 
not satisfied, there will necessarily be periods when no input at all is occurring. 
Thus we can minimize the merging time if we keep the output tape busy. 

An important technique called forecasting leads to the desired effect. While 
we are doing a P- way merge, we generally have P current input buffers, which 
are being used as the source of data; some of them are more full than others, 
depending on how much of their data has already been scanned. If all of them 
become empty at about the same time, we will need to do a lot of reading before 
we can proceed further, unless we have foreseen this eventuality in advance. 
Fortunately it is always possible to tell which buffer will empty first, by simply 
looking at the last record in each buffer. The buffer whose last record has the 
smallest key will always be the first one empty, regardless of the values of any 
other keys; so we always know which file should be the source of our next input 
command. The following algorithm spells out this principle in detail. 

Algorithm F ( Forecasting with floating buffers). This algorithm controls the 
buffering during a P- way merge of long input files, for P > 2. Assume that the 
input tapes and files are numbered 1,2, ...,P. The algorithm uses 2 P input 
buffers I [1] , . . . , I [2P] ; two output buffers 0 [0] and 0 [1] ; and the following 
auxiliary tables: 

A [j] , 1 <j< 2 P: 0 if I [ j] is available for input, 1 otherwise. 

BH, 1 <i < P: Index of the buffer holding the last block read so far from file i. 
C [i] . 1 <i< P: Index of the buffer currently being used for the input from file i. 
L [*], 1 <i< P: The last key read so far from file i. 

S [/] , 1 < j < 2 P: Index of the buffer to use when I [ j] becomes empty. 

The algorithm described here does not terminate; an appropriate way to shut it 
off is discussed below. 


322 


SORTING 


5.4.6 



Fig. 83. Forecasting with floating buffers. 


FI. [Initialize.] Read the first block from tape i into buffer I [j] , set A [/] 1. 

A [P+i] -<—0, BM -t- i , CM i, and set LM to the key of the final 
record in buffer I M , for 1 < i < P. Then find m such that L[m] = 
min{L [1] , . . . , L [P] }; and set t 0, k 4- P + 1. Begin to read from tape 
m into buffer I [fc] . 

F2. [Merge.] Merge records from buffers I [C [1] ] , . . . , I [C [P] ] to 0 [£] , until 
0 [t] is full. If during this process an input buffer, say I [C M ] , becomes 
empty and 0 M is not yet full, set A [C [/] ] -< — 0, C [?] «— S [C [z] ] , and 
continue to merge. 

F3. [I/O complete.] Wait until the previous read (or read/write) operation is 
complete. Then set A[fc] <— 1, S [B [m] ] <— k, B[m] k , and set L[m] to 
the key of the final record in I [fc] . 

F4. [Forecast.] Find m such that L [m] = min{L [1] , . . . , L [P] }, and find k such 
that A [fc] = 0. 

F5. [Read/write.] Begin to read from tape m into buffer I [A:] , and to write from 
buffer 0[f] onto the output tape. Then set t 1 — t and return to F2. | 

The example in Fig. 84 shows how forecasting works when P — 2, assuming 
that each block on tape contains only two records. The input buffer contents are 
illustrated each time we get to the beginning of step F2. Algorithm F essentially 
forms P queues of buffers , with C H pointing to the front and B [?'] to the rear 
of the ith queue, and with S [ j] pointing to the successor of buffer I [j] ; these 
pointers are shown as arrows in Fig. 84. Line 1 illustrates the state of affairs 
after initialization: There is one buffer for each input file, and another block is 
being read from File 1 (since 03 < 05). Line 2 shows the status of things after the 
first block has been merged: We are outputting a block containing | 01 02 | , and 
inputting the next block from File 2 (since 05 < 09). Note that in line 3, three 
of the four input buffers are essentially committed to File 2, since we are reading 
from that file and we already have a full buffer and a partly full buffer in its 







5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 323 


File 1 contains 

01 

03 

04 

09 | 

1 11 

13 1 

16 

18 | | 










File 2 contains 

02 

05 | 

1 06 

07 

08 

10 1 

llL 

111! ■■■ 


Line No. 
1 
2 


3 


4 

5 

6 
7 


Buffers for File 1 
— H 01 03 K ~ 


~03~1— > |04 ~09~| <- 


H 09 K - 


~09~|^— 


-> | 09 H 11 13 K ~ 

— > 1 11 13 K - 

-> | 13 | ~ H 16 18 | <- 



Buffers for File 2 

Next input being 
read from 

H 

02 05 j^— 


File 1 

h 

05 |<- 


File 2 

h 

05 h->f06" 

~07]<- 

File 2 

h 

07 |->f08~ 

~kT|<- 

File 1 

h 

lo]<- 


File 2 

h 

HU 

J4]<— 

File 1 

h 

TIH- 


File 2 


Fig. 84. Buffer queuing, according to Algorithm F. 


queue. This floating-buffer arrangement is an important feature of Algorithm F, 
since we would be unable to proceed in line 4 if we had chosen File 1 instead of 
File 2 for the input on line 3. 

In order to prove that Algorithm F is valid, we must show two things: 

i) There is always an input buffer available (that is, we can always find a k in 
step F4). 

ii) If an input buffer is exhausted while merging, its successor is already present 
in memory (that is, S[C [■/']] is meaningful in step F2). 

Suppose (i) is false, so that all buffers are unavailable at some point when we 
reach step F4. Each time we get to that step, the total amount of unprocessed 
data among all the buffers is exactly P bufferloads, just enough data to fill 
P buffers if it were redistributed, since we are inputting and outputting data 
at the same rate. Some of the buffers are only partially full; but at most one 
buffer for each file is partially full, so at most P buffers are in that condition. By 
hypothesis all 2 P of the buffers are unavailable; therefore at least P of them must 
be completely full. This can happen only if P are full and P are empty, otherwise 
we would have too much data. But at most one buffer can be unavailable and 
empty at any one time; hence (i) cannot be false. 

Suppose (ii) is false, so that we have no unprocessed records in memory, 
for some file, but the current output buffer is not yet full. By the principle of 
forecasting, we must have no more than one block of data for each of the other 
files, since we do not read in a block for a file unless that block will be needed 
before the buffers on any other file are exhausted. Therefore the total number of 
unprocessed records amounts to at most P —1 blocks; adding the unfilled output 
buffer leads to less than P bufferloads of data in memory, a contradiction. 






324 SORTING 


5.4.6 


This argument establishes the validity of Algorithm F; and it also indicates 
the possibility of pathological circumstances under which the algorithm just 
barely avoids disaster. An important subtlety that we have not mentioned, 
regarding the possibility of equal keys, is discussed in exercise 5. See also 
exercise 4, which considers the case P — 1. 

One way to terminate Algorithm F gracefully is to set L [m] to oo in step F3 
if the block just read is the last of a run. (It is customary to indicate the end of 
a run in some special way.) After all of the data on all of the files has been read, 
we will eventually find all of the L's equal to oo in step F4; then it is usually 
possible to begin reading the first blocks of the next run on each file, beginning 
initialization of the next merge phase as the final P + 1 blocks are output. 

Thus we can keep the output tape going at essentially full speed, without 
reading more than one tape at a time. An exception to this rule occurs in step FI . 
where it would be beneficial to read several tapes at once in order to get things 
going in the beginning; but step FI can usually be arranged to overlap with the 
preceding part of the computation. 

The idea of looking at the last record in each block, to predict which buffer 
will empty first, was discovered in 1953 by F. E. Holberton. The technique was 
first published by E. H. Friend [JACM 3 (1956), 144-145, 165], His rather 
complicated algorithm used 3 P input buffers, with three dedicated to each 
input file; Algorithm F improves the situation by making use of floating buffers, 
allowing any single file to claim as many as P + 1 input buffers at once, yet 
never needing more than 2 P in all. A discussion of merging with fewer than 2 P 
input buffers appears at the end of this section. Some interesting improvements 
to Algorithm F are discussed in Section 5.4.9. 

Comparative behavior of merge patterns. Let us now use what we know 
about tapes and merging to compare the effectiveness of the various merge 
patterns that we have studied in Sections 5.4.2 through 5.4.5. It is very in- 
structive to work out the details when each method is applied to the same task. 
Consider therefore the problem of sorting a file whose records each contain 100 
characters, when there are 100,000 character positions of memory available for 
data storage not counting the space needed for the program and its auxiliary 
vaiiables, or the space occupied by links in a selection tree. (Remember that 
we are pretending to live in the days when memories were small.) The input 
appears in random order on tape, in blocks of 5000 characters each, and the 
output is to appear in the same format. There are five scratch tapes to work 
with, in addition to the unit containing the input tape. 

The total number of records to be sorted is 100,000, but this information is 
not known in advance to the sorting algorithm. 

The foldout illustration in Chart A summarizes the actions that transpire 
when ten different merging schemes are applied to this data. The best way to look 
at this important illustration is to imagine that you are actually watching the 
sort take place: Scan each line slowly from left to right, pretending that you can 
actually see six tapes reading, writing, rewinding, and/or reading backwards, as 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 325 


indicated on the diagram. During a P - way merge the input tapes will be moving 
only 1/P times as often as the output tape. When the original input tape has 
been completely read (and rewound “with lock”), Chart A assumes that a skilled 
computer operator dismounts it and replaces it with a scratch tape, in just 30 
seconds. In examples 2, 3, and 4 this is “critical path time” when the computer 
is idly waiting for the operator to finish; but in the remaining examples, the 
dismount-reload operation is overlapped by other processing. 

Example 1. Read- forward balanced merge. Let’s review the specifications 
of the problem: The records are 100 characters long, there is enough internal 
memory to hold 1000 records at a time, and each block on the input tape contains 
5000 characters (50 records). There are 100,000 records (= 10,000,000 characters 
= 2000 blocks) in all. 

We are free to choose the block size for intermediate files. A six-tape 
balanced merge uses three-way merging, so the technique of Algorithm F calls for 
8 buffers; we may therefore use blocks containing 1000/8 = 125 records (= 12500 
characters) each. 

The initial distribution pass can make use of replacement selection (Algo- 
rithm 5.4. 1R), and in order to keep the tapes running smoothly we may use two 
input buffers of 50 records each, plus two output buffers of 125 records each. 
This leaves room for 650 records in the replacement selection tree. Most of the 
initial runs will therefore be about 1300 records long (10 or 11 blocks); it turns 
out that 78 initial runs are produced in Chart A, the last one being rather short. 

The first merge pass indicated shows nine runs merged to tape 4, instead of 
alternating between tapes 4, 5, and 6. This makes it possible to do useful work 
while the computer operator is loading a scratch tape onto unit 6; since the total 
number 5 of runs is known once the initial distribution has been completed, the 
algorithm knows that [5/9] runs should be merged to tape 4, then [(5 — 3)/9] 
to tape 5, then [(5 — 6)/9] to tape 6. 

The entire sorting procedure for this example can be summarized in the 
following way, using the notation introduced in Section 5.4.2: 


^26 -^26 -^26 

9 3 9 3 9 2 6 x 

78 1 


3 9 3 9 

27 1 27 1 


3 8 

24 1 


Example 2. Read-forward polyphase merge. The second example in 
Chart A carries out the polyphase merge, according to Algorithm 5. 4. 2D. In 
this case we do five-way merging, so the memory is split into 12 buffers of 83 
records each. During the initial replacement selection we have two 50-record 
input buffers and two 83-record output buffers, leaving 734 records in the tree; 
so the initial runs this time are about 1468 records long (17 or 18 blocks). The 
situation illustrated shows that 5 = 70 initial runs were obtained, the last two 


326 SORTING 


5.4.6 


actually being only four blocks and one block long, respectively. The merge 
pattern can be summarized thus: 


0 13 1 18 0 13 1 17 
2^15 ^14 

l 7 l 6 

l 3 l 2 

l 1 

34 1 

70 1 


0 i3 : 15 Q12J12 

l 12 l 8 

l 4 

8 4 

le 1 ^ 1 8 2 

19 1 8 1 


0 8 1 8 

0 8 1 4 2 1 5 3 
4 8 1 4 2 x 5 3 

4 4 2 1 5 3 

4 2 5 2 

4 1 5 1 


Curiously, polyphase actually took about 25 seconds longer than the far less 
sophisticated balanced merge! There are two main reasons for this: 

1) Balanced merge was particularly lucky in this case, since S = 78 is just 
less than a perfect power of 3. If 82 initial runs had been produced, the balanced 
merge would have needed an extra pass. 

2) Polyphase merge wasted 30 seconds while the input tape was being 
changed, and a total of more than 5 minutes went by while it was waiting for 
rewind operations to be completed. By contrast the balanced merge needed 
comparatively little rewind time. In the second phase of the polyphase merge, 
13 seconds were saved because the 8 dummy runs on tape 6 could be assumed 
present even while that tape was rewinding; but no other rewind overlap oc- 
curred. Therefore polyphase lost out even though it required significantly less 
read/write time. 


Example 3. Read-forward cascade 

preceding, but using Algorithm 5.4.3C. 

merge. This case is analogous to the 
The merging may be summarized thus: 

I 14 

1 15 

112 

ll4 i 15 

l 5 

l 9 

— 

l 14 jis 132336 


5 3 

5 3 6 2 

l 1 2 2 

— 

12 1 

6 1 

18 1 18 1 16 1 

70 1 

— 



__ 


(Remember to watch each of these examples in action, by scanning Chart A in 
the foldout illustration.) 


Example 4. Tape-splitting polyphase merge. This procedure, described at 
the end of Section 5.4.2, allows most of the rewind time to be overlapped. It uses 
four-way merging, so we divide the memory into ten 100-record buffers; there are 
700 records in the replacement selection tree, so it turns out that 72 initial runs 
are formed. The last run, again, is very short. A distribution scheme analogous 
to Algorithm 5. 4. 2D has been used, followed by a simple but somewhat ad hoc 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 327 


method of placing dummy runs: 


l 21 

I 19 

I 15 

I 8 

— 

0 2 1 9 

0 2 1 17 

0 2 1 15 

0 2 l u 

0 2 1 4 

— 

0 2 1 9 4 4 

l 13 

l 11 

l 7 

— 

0 2 4 4 

0 2 1 9 4 4 

flO 

l 8 

l 4 

— 

0 2 4 4 3 2 4 1 

1 s 4 4 

l 6 

l 4 

— 

4 4 

0 2 4 4 3 2 4 4 

1 4 4 4 

l 5 

l 3 

— 

4 4 3 4 

0 1 4 4 3 2 4 1 

1 3 4 4 

l 2 

— 

3 4 7 2 

4 4 3 1 

4 2 3 2 4 i 

4 4 

l 1 

— 


4 3 3 x 

4 1 3 2 4 1 

4 3 

— 

13 1 

3 1 7 2 13 1 

4 2 3 x 

3 2 4 4 

4 2 

— 

13H4 1 

7 2 13 x 

4*3! 


4 1 

18 1 

lS 1 ^ 1 

7 1 13 1 

3 1 

4 1 

— 

18 1 

14 1 

13 1 

— 

— 

27 1 

— 

— 

— 

72 1 

— 

— 


This turns out to give the best running time of all the examples in Chart A that 
do not read backwards. Since S will never be very large, it would be possible to 
develop a more complicated algorithm that places dummy runs in an even better 
way; see Eq. 5.4.2-(26). 

Example 5. Cascade merge with rewind overlap. This procedure runs 
almost as fast as the previous example, although the algorithm governing it is 
much simpler. We simply use the cascade sort method as in Algorithm 5.4.3C 
for the initial distribution, but with T = 5 instead of T — 6. Then each phase 
of each “cascade” staggers the tapes so that we ordinarily don’t write on a tape 
until after it has had a chance to be rewound. The pattern, very briefly, is 


I 21 

^22 

I 19 

I 10 

— 

— 

l 4 

i 7 

— 

— 

1 2 2 2 3 5 

4 io 

7 2 

— 

8 3 

7 2 8 2 


4 1 

— 

26 1 


8 1 

22 1 

16 1 

72 1 

— 

— 

— 

— 

— 


Example 6. Read-backward balanced merge. This is like example 1 but 
with all the rewinding eliminated: 


A 26 

A 26 

A 26 

— 

— 

— 

— 

— 

— 

Dl 

D% 

Dl 



A 2 A 1 
^-9 ^-6 


— 

— 

<— 

— 

— 

D \ 4 

Db 

Dlf 

^78 


— 

— 

— 



328 SORTING 


5.4.6 


Since there was comparatively little rewinding in example 1, this scheme is not a 
great deal better than the read-forward case. In fact, it turns out to be slightly 
slower than tape-splitting polyphase, in spite of the fortunate value S — 78. 

Example 7. Read-backward polyphase merge. In this example only five of 
the six tapes are used, in order to eliminate the time for rewinding and changing 
the input tape. Thus, the merging is only four-way, and the buffer allocation 
is like that in examples 4 and 5. A distribution like Algorithm 5. 4. 2D is used, 
but with alternating directions of runs, and with tape 1 fixed as the final output 
tape. First an ascending run is written on tape 1; then descending runs on tapes 
2, 3, 4; then ascending runs on 2, 3, 4; then descending on 1, 2, 3; etc. Each time 
we switch direction, replacement selection usually produces a shorter run, so it 
turns out that 77 initial runs are formed instead of the 72 in examples 4 and 5. 

This procedure results in a distribution of (22, 21, 19, 15) runs, and the next 
perfect distribution is (29, 56, 52, 44). Exercise 5.4.4 5 shows how to generate 
strings of merge numbers that can be used to place dummy runs in optimum 
positions; such a procedure is feasible in practice because the finiteness of a tape 
reel ensures that S is never too large. Therefore the example in Chart A has 
been constructed using such a method for dummy run placement (see exercise 7). 
This turns out to be the fastest of all the examples illustrated. 

Example 8. Read-backward cascade merge. As in example 7, only five 
tapes are used here. This procedure follows Algorithm 5.4.3C, using rewind and 
forward read to avoid one-way merging (since rewinding is more than twice as 
fast as reading on MIXT units). Distribution is therefore the same as in example 5. 
The pattern may be summarized briefly as follows, using | to denote rewinding: 


Aj 1 

A? 

A} 9 

A\° 

-*«*' 

Ail 

All 

— 

D\D\D\ 

D\° 

C4t- 

oo 

Ai 


- 

D\l 


D\7 

AqI 

D 25 

D 21 

A 72 

— 

— 

— 

— 


Example 9. Read-backward oscillating sort. Oscillating sort with T — 5 
(Algorithm 5.4.5B) can use buffer allocation as in examples 4, 5, 7, and 8, since 
it does four-way merging. However, replacement selection does not behave in 
the same way, since a run of length 700 (not 1400 or so) is output just before 
entering each merge phase, in order to clear the internal memory. Consequently 
85 runs are produced in this example, instead of 72. Some of the key steps in 
the process are 


D 4 


A, 


AiAi A\Ai AiAi 

A\ Ai A\ 


5.4.6 PRACTICAL CONSIDERATIONS FOR TAPE MERGING 329 


D 4 D 4 

D 4 D 4 

D 4 D 4 

Da 

d 4 

Da 

Da 

- 

Axe 

d 4 

AiqDa Da 

AiqDa 

Ai6-D 4 j4i ^4i6 

D 4 

A\qD 4 D 4 

AasDaDa 

Ait 

> Da Ax 6 

— 

AiqDa 

AxqDa 

A 

16 ^4l6^4l3 

— 

A\qDa 

Ax& 

^16^4 ^16^13 

— - 

A \6 

AxeA4 

j4i6^4 4 "4i6^4i3 

D 37 

— 

^164- 

Axb-I Axei 

— 

a 85 

— 

- 

— 

Example 10. Read-forward oscillating sort. In the final example, replace- 

ment selection is not used because all initial runs 

must be the same length. 

Therefore full core loads of 1000 records are 

sorted internally whenever an initial 

run is required; this makes S = 

100. Some key steps 

in the process are 

A x 

Ax 

^1 

Ax 

Ax 

— 

— 

— • - 

— 

A\A 4 

Ax 

Ax 

Ax 

Ax 

AxA 4 

— 

— 

— 

Ax A 4 

%xAa 

Ax 

Ax 

Ax 

^ 1^4 

AxA4 


A 1 A 4 

^4i^4 

Ax A 4 

AxAa 

A 1 A 4 

4xA 4 


^4 

%xA 4 

A 1 A 4 A 16 

— 

— 

f — 

— 

A 1 A 4 

^ 1^4 

AxA 4 

AxAaAxqAq4 

A 4 



%X A 4 

^^AaAxqAqa 

A 4 A 16 

— 

— 

— 

^1^4^16^64 

-^4^16 

Aa 

— 

— 

^-1^-4^446^64 

— 

— 

— 

^36 

4^1 -^4-^16^64 

^4 100 

— 

— 

— 

— 


This routine turns out to be slowest of all, partly because it does not use 
replacement selection, but mostly because of its rather awkward ending (a two- 
way merge). 


330 SORTING 


5.4.6 



Fig. 85. A somewhat misleading way to compare merge patterns. 


Estimating the running time. Let’s see now how to figure out the ap- 
proximate execution time of a sorting method using MIXT tapes. Could we 
have predicted the outcomes shown in Chart A without carrying out a detailed 
simulation? 

One way that has traditionally been used to compare different merge pat- 
terns is to superimpose graphs such as we have seen in Figs. 70, 74, and 78. 
These graphs show the effective number of passes over the data, as a function of 
the number of initial runs, assuming that each initial run has approximately the 
same length. (See Fig. 85.) But this is not a very realistic comparison, because 
we have seen that different methods lead to different numbers of initial runs; 
furthermore there is a different overhead time caused by the relative frequency 
of interblock gaps, and the rewind time also has significant effects. All of these 
machine-dependent features make it impossible to prepare charts that provide 
a valid machine-independent comparison of the methods. On the other hand. 
Fig. 85 does show us that, except for balanced merge, the effective number 
of passes can be reasonably well approximated by smooth curves of the form 
Of In S' + /?. Therefore we can make a fairly good comparison of the methods 
in any particular situation, by studying formulas that approximate the running 
time. Our goal, of course, is to find formulas that are simple yet sufficiently 
realistic. 

Let us now attempt to develop such formulas, in terms of the following 
parameters: 

N = number of records to be sorted, 

C = number of characters per record, 

M — number of character positions available in the internal memory (assumed to 
be a multiple of C), 



PRACTICAL CONSIDERATIONS FOR TAPE MERGING 331 


5.4.6 


t = number of seconds to read or write one character, 
pr = number of seconds to rewind over one character, 
err = number of seconds for stop/start time delay, 

7 = number of characters per interblock gap, 

6 = number of seconds for operator to dismount and replace input tape, 

B t = number of characters per block in the unsorted input, 

B a = number of characters per block in the sorted output. 

For MIXT we have r = 1/60000, p = 2/5, a — 300, 7 = 480. The example 
application treated above has N = 100000, C = 100, M — 100000, S = 30, J3; = 
B a — 5000. These parameters are usually the machine and data characteristics 
that affect sorting time most critically (although rewind time is often given by a 
more complicated expression than a simple ratio p). Given the parameters above 
and a merge pattern, we shall compute further quantities such as 


P = maximum order of merge in the pattern, 

P' = number of records in replacement selection tree, 

S = number of initial runs, 

7 T — a In S + /? = approximate average number of times each character is read 
and written, not counting the initial distribution or the final 
merge, 

7r' = a' In S + (}' = approximate average number of times rewinding over each 
character during intermediate merge phases, 

B = number of characters per block in the intermediate merge 
phases, 


Wi,w,w 0 = “overhead ratio,” the effective time required to read or write 
a character (due to gaps and stop/start) divided by the hard- 
ware time r. 


The examples of Chart A have chosen block and buffer sizes according to 
the formula 

D M 

[c(2P + 2)\ ’ ^ 

so that the blocks can be as large as possible consistent with the buffering scheme 
of Algorithm F. (In order to avoid trouble during the final pass, P should be 
small enough that (1) makes B > B a .) The size of the tree during replacement 
selection is then 

P' — (M - 2 Bi - 2 B)/C. (2) 

For random data the number of initial runs S can be estimated as 


S « 


- at 7- 
2P 7 + 6 


(3) 


332 SORTING 


5.4.6 


using the results of Section 5.4.1. Assuming that B t < B and that the input 
tape can be run at full speed during the distribution (see below), it takes about 
NCcuiT seconds to distribute the initial runs, where 

u>i = ( B l + 7 )/Bi. ( 4 ) 

While merging, the buffering scheme allows simultaneous reading, writing, and 
computing, but the frequent switching between input tapes means that we must 
add the stop/start time penalty; therefore we set 

w=(B + 'y + a)/B, ( 5 ) 

and the merge time is approximately 

(n + pn')NCuT. (6) 

This formula penalizes rewind slightly, since w includes stop/start time, but 
other considerations, such as rewind interlock and the penalty for reading from 
load point, usually compensate for this. The final merge pass, assuming that 
B 0 < B, is constrained by the overhead ratio 

w o = ( B a + 7 )/ B 0 . ( 7 ) 

We may estimate the running time of the final merge and rewind as 

NC(l + p)uj 0 t- 

in practice it might take somewhat longer due to the presence of unequal block 
lengths (input and output are not synchronized as in Algorithm F), but the 
running time will be pretty much the same for all merge patterns. 

Before going into more specific formulas for individual patterns, let us try 
to justify two of the assumptions made above. 

a) Can replacement selection keep up with the input tape? In the examples 
of Chart A it probably can, since it takes about ten iterations of the inner 
loop of Algorithm 5.4. 1R to select the next record, and we have Cw.r > 1667 
microseconds in which to do this. With careful programming of the replacement 
selection loop, this can be done on most machines (even in the 1970s). Notice 
that the situation is somewhat less critical while merging: The computation time 
per record is almost always less than the tape time per record during a P- way 
merge, since P isn’t very large. 

b) Should we really choose B to he the maximum possible buffer size , as 
in ( 1 )? A large buffer size cuts down the overhead ratio w in ( 5 ); but it also 
increases the number of initial runs S, since P' is decreased. It is not immediately 
clear which factor is more important. Considering the merging time as a function 
of x = CP' . we can express it in the approximate form 



for some appropriate constants 0 U 0 2 , 63 , 0 4 , with 63 > 0 4 . Differentiating with 
respect to x shows that there is some N 0 such that for all N > N 0 it does not pay 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 333 


to increase x at the expense of buffer size. In the sorting application of Chart A, 
for example, No turns out to be roughly 10000 ; when sorting more than 10000 
records the large buffer size is superior. 

Note, however, that with balanced merge the number of passes jumps sharply 
when 5 passes a power of P. If an approximation to N is known in advance, 
the buffer size should be chosen so that S will most likely be slightly less than a 
power of P. For example, the buffer size for the first line of Chart A was 12500; 
since S = 78, this was very satisfactory, but if S had turned out to be 82 it 
would have been much better to decrease the buffer size a little. 

Formulas for the ten examples. Returning to Chart A, let us try to give 
formulas that approximate the running time in each of the ten methods. In most 
cases the basic formula 

NCuj t r + (it + ptt')NCujt + (1 + p)NCuj 0 t ( 9 ) 

will be a sufficiently good approximation to the overall sorting time, once we 
have specified the number of intermediate merge passes it = a In S + 0 and the 
number of intermediate rewind passes 7 r' = a' In S + /3'. Sometimes it is necessary 
to add a further correction to ( 9 ); details for each method can be worked out as 
follows: 

Example 1 . Read-forward balanced merge. The formulas 
7 T = rinS/lnP] - 1, tt' = \\nS/\nP]/P 
may be used for P- way merging on 2 P tapes. 

Example 2. Read-forward polyphase merge. We may take 7 r' « 7 r, since 
every phase is usually followed by a rewind of about the same length as the 
previous merge. From Table 5.4. 2-1 we get the values a « 0.795, 0 « 0.864 - 2, 
in the case of six tapes. (We subtract 2 because the table entry includes the 
initial and final passes as well as the intermediate ones.) The time for rewinding 
the input tape after the initial distribution, namely pNCuiiT+S , should be added 
to ( 9 ). 

Example 3. Read-forward cascade merge. Table 5. 4. 3-1 gives the values 
a ss 0.773, 0 k, 0.808 - 2. Rewind time is comparatively difficult to estimate; 
perhaps setting 7 r' « 7 r is accurate enough. As in example 2, we need to add the 
initial rewind time to ( 9 ). 

Example 4. Tape-splitting polyphase merge. Table 5. 4. 2-6 tells us that 
a « 0.752, 0 « 1.024 - 2. The rewind time is almost overlapped except after 
the initialization ( pNCtOiT + S ) and two phases near the end (2 pNCcor times 
36 percent). We may also subtract 0.18 from 0 since the first half phase is 
overlapped by the initial rewind. 

Example 5. Cascade merge with rewind overlap. In this case we use 
Table 5.4.3-1 for T = 5, to get a « 0.897, 0 « 0.800 - 2. Nearly all of the 
unoverlapped rewind occurs just after the initial distribution and just after each 


334 SORTING 


5.4.6 


two-way merge. After a perfect initial distribution, the longest tape contains 
about 1 / g of the data, where g is the “growth ratio.” After each two-way merge 
the amount of rewind in the six-tape case is (see exercise 5 . 4 . 3 - 5 ), hence 

the amount of rewind after two-way merges in the 7 ^- tape case can be shown to 
be approximately 

(2/(2 T - 1)) (1 - cos( 47 r/( 2 T - 1))) 

of the hie. In our case, T = 5, this is §(1 - cos 80°) « 0.184 of the hie, and the 
number of times it occurs is 0.946 In S + 0.796 - 2. 

Example 6. Read-backward balanced merge. This is like example 1, ex- 
cept that most of the rewinding is eliminated. The change in direction from 
forward to backward causes some delays, but they are not significant. There is 
a 50-50 chance that rewinding will be necessary before the final pass, so we mav 
take 7 r' = 1 /( 2 P). 

Example 7. Read-backward polyphase merge. Since replacement selec- 
tion in this case produces runs that change direction about every P times, we 
must replace ( 3 ) by another formula for S. A reasonably good approximation, 
suggested by exercise 5.4. T 24, is S = [AT (3 + 1/ P)/(6P')~\ -f- 1 . All rewind time 
is eliminated, and Table 5.4.2-1 gives a « 0.863, (3 « 0.921 - 2. 

Example 8. Read-backward cascade merge. From Table 5.4.3 1 we have 
a 0.897, ft ps 0.800 - 2. The rewind time can be estimated as twice the 
difference between “passes with copying” minus “passes without copying” in 
that table, plus 1 /( 2 P) in case the final merge must be preceded by rewinding 
to get ascending order. 

Example 9. Read-backward oscillating sort. In this case replacement se- 
lection has to be started and stopped many times; bursts of P - 1 to 2P — 1 
runs are distributed at a time, averaging P in length; the average length of runs 
therefore turns out to be approximately P'{2P - 4/3 )/P, and we may estimate 
S — I -/V/((2 — 4/ (3P))P')] + 1 . A little time is used to switch from merging to 
distribution and vice-versa; this is approximately the time to read in P' records 
from the input tape, namely P'Cuj.t, and it occurs about S/P times. Rewind 
time and merging time may be estimated as in example 6 . 

Example 10. Read-forward oscillating sort. This method is not easy to 

analyze, because the final “cleanup” phases performed after the input is ex- 
hausted are not as efficient as the earlier phases. Ignoring this troublesome 
aspect, and simply calling it one extra pass, we can estimate the merging time by 
setting a = 1/lii.P, [3 = 0, and 7 r' = n/P. The distribution of runs is somewhat 
different in this case, since replacement selection is not used; we set P' = M/C 
and S = \N / P ] . With care we will be able to overlap computing, reading, and 
writing during the distribution, with an additional factor of about ( M+2B) /M in 
the overhead. The “mode-switching” time mentioned in example 9 is not needed 
in the present case because it is overlapped by rewinding. So the estimated 
sorting time in this case is ( 9 ) plus 2BNClo 1 t/M. 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 335 


Table 1 

SUMMARY OF SORTING TIME ESTIMATES 


Additions Est. Actual 


Ex. 

P 

B 

P' 

S 

OJ 

a 

0 

a' 

0' 

(9) 

to ( 9 ) 

total 

total 

1 

3 

12500 

650 

79 

1.062 

0.910 

- 1.000 

0.303 

0.000 

1064 


1064 

1076 

2 

5 

8300 

734 

70 

1.094 

0.795 

-1.136 

0.795 

-1.136 

1010 

pNCuiiT + 5 

1113 

1103 

3 

5 

8300 

734 

70 

1.094 

0.773 

-1.192 

0.773 

-1.192 

972 

pNCcoiT + S 

1075 

1127 

4 

4 

10000 

700 

73 

1.078 

0.752 

-0.994 

0.000 

0.720 

844 

pNCujir + 5 

947 

966 

5 

4 

10000 

700 

73 

1.078 

0.897 

-1.200 

0.173 

0.129 

972 


972 

992 

6 

3 

12500 

650 

79 

1.062 

0.910 

- 1.000 

0.000 

0.167 

981 


981 

980 

7 

4 

10000 

700 

79 

1.078 

0.863 

-1.079 

0.000 

0.000 

922 


922 

907 

8 

4 

10000 

700 

73 

1.078 

0.897 

-1.200 

0.098 

0.117 

952 


952 

949 

9 

4 

10000 

700 

87 

1.078 

0.721 

- 1.000 

0.000 

0.125 

846 

P'SCuht/P 

874 

928 

10 

4 

10000 

— 

100 

1.078 

0.721 

0.000 

0.180 

0.000 

1095 

2BNCuiir/M 

1131 

1158 


Table 1 shows that the estimates are not too bad in these examples, although 
in a few cases there is a discrepancy of 50 seconds or so. The formulas in 
examples 2 and 3 indicate that cascade merge should be preferable to polyphase 
on six tapes, yet in practice polyphase was better. The reason is that graphs 
like Fig. 85 (which shows the five-tape case) are more nearly straight lines for 
the polyphase algorithm; cascade is superior to polyphase on six tapes for 14 < 
S < 15 and 43 < S < 55, near the “perfect” cascade numbers 15 and 55, but 
the polyphase distribution of Algorithm 5.4. 2D is equal or better for all other 
S < 100. Cascade will win over polyphase as S -» oo, but S doesn’t actually 
approach oo. The underestimate in example 9 is due to similar circumstances; 
polyphase was superior to oscillating even though the asymptotic theory tells us 
that oscillating will be better for large 5. 

Some miscellaneous remarks. It is now appropriate to make a few more or 
less random observations about tape merging. 

• The formulas above show that the cost of tape sorting is essentially a 
function of N times C , not of N and C independently. Except for a few relatively 
minor considerations (such as the fact that B was taken to be a multiple of C), 
our formulas say that it takes about as long to sort one million records of 10 
characters each as to sort 100,000 records of 100 characters each. Actually there 
may be a difference, not revealed in our formulas, because of the space used by 
link fields during replacement selection. In any event the size of the key makes 
hardly any difference, unless keys get so long and complicated that internal 
computation cannot keep up with the tapes. 

With long records and short keys it is tempting to “detach” the keys, sort 
them first, and then somehow rearrange the records as a whole. But this idea 
doesn’t really work; it merely postpones the agony, because the final rearrange- 
ment procedure takes about as long as a conventional merge sort would take. 

• When writing a sort routine that is to be used repeatedly, it is wise to 
estimate the running time very carefully and to compare the theory with actual 
observed performance. Since the theory of sorting has been fairly well developed, 
this procedure has been known to turn up bugs in the input /output, hardware or 


336 SORTING 


5.4.6 


software on existing systems; the service was substantially slower than it should 
have been, yet nobody had noticed it until the sorting routine ran too slowly! 

• Our analysis of replacement selection has been carried out for “random” 
files, but the files that actually arise in practice very often have a good deal of 
existing order. (In fact, sometimes people will sort a file that is already in order, 
just to be sure.) Therefore experience has shown that replacement selection 
is preferable to other kinds of internal sort, even more so than our formulas 
indicate. This advantage is slightly mitigated in the case of read-backward 
polyphase sorting, since a number of descending runs must be produced; indeed, 
R. L. Gilstad (who first published the polyphase merge) originally rejected the 
lead-backward technique for that reason. But he noticed later that alternating 
directions will still pick up long ascending runs. Furthermore, read-backward 
polyphase is the only standard technique that likes descending input files as well 
as ascending ones. 

• Another advantage of replacement selection is that it allows simultaneous 
reading, writing, and computing. If we merely did the internal sort in an obvious 
way — filling the memory, sorting it, then writing it out as it becomes filled with 
the next load — the distribution pass would take about twice as long. 

The only other internal sort we have discussed that appears to be amenable 
to simultaneous reading, writing, and computing is heapsort. Suppose for con- 
venience that the internal memory holds 1000 records, and that each block on 
tape holds 100. Example 10 of Chart A was prepared with the following strategy, 
letting B\ B 2 . . . B w stand for the contents of memory divided into ten 100-record 
blocks: 

Step 0. Fill memory, and make the elements of B 2 . . . B 10 satisfy the inequalities 
for a heap (with smallest element at the root). 

Step 1. Make B\ . . . Bio into a heap, then select out the least 100 records and 
move them to B w . 

Step 2. Write out B 10 , while selecting the smallest 100 records of B x . . . B 9 and 
moving them to Bq. 

Step 3. Read into B 10 , and write out B 9 , while selecting the smallest 100 records 
of Bj . . . B h and moving them to B$. 

Step 9. Read into B 4 , and write out B 3 , while selecting the smallest 100 records 
of Bi B 2 and moving them to B 2 and while making the heap inequalities valid 
in B 5 . . . B w . 

Step 10. Read into B 3 , and write out B 2 , while sorting B\ and while making 
the heap inequalities valid in B 4 . . . B w . 

Step 11. Read into B 2 , and write out B 4 , while making the heap inequalities 
valid in B 3 . . . B 4 0 . 

Step 12. Read into B±, while making the heap inequalities valid in B 2 . . . /j ln . 
Return to step 1. | 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 337 


• We have been assuming that the number N of records to be sorted is not 
known in advance. Actually in most computer applications it would be possible 
to keep track of the number of records in all files at all times, and we could assume 
that our computer system is capable of telling us the value of N. How much help 
would this be? Unfortunately, not very much! We have seen that replacement 
selection is very advantageous, but it leads to an unpredictable number of initial 
runs. In a balanced merge we could use information about N to set the buffer 
size B in such a way that S will probably be just less than a power of P; and in 
a polyphase distribution with optimum placement of dummy runs we could use 
information about N to decide what level to shoot for (see Table 5. 4. 2-2). 

• Tape drives tend to be the least reliable part of a computer. Therefore the 
original input tape should never be destroyed until it is known that the entire sort 
has been satisfactorily completed. The “operator dismount time” is annoying in 
some of the examples of Chart A, but it would be too risky to overwrite the input 
in view of the probability that something might go wrong during a long sort. 

• When changing from forward write to backward read, we could save some 
time by never writing the last bufferload onto tape; it will just be read back in 
again anyway. But Chart A shows that this trick actually saves comparatively 
little time, except in the oscillating sort where directions are reversed frequently. 

• Although a large computer system might have lots of tape units, we might 
be better off not using them all. The percentage difference between log P S and 
logp+i S is not very great when P is large, and a higher order of merge usually 
implies a smaller block size. (Consider also the poor computer operator who 
has to mount all those scratch tapes.) On the other hand, exercise 12 describes 
an interesting way to make use of additional tape units, grouping them so as to 
overlap input/output time without increasing the order of merge. 

• On machines like MIX, which have fixed rather small block sizes, hardly any 
internal memory is needed while merging. Oscillating sort then becomes more 
attractive, because it becomes possible to maintain the replacement selection 
tree in memory while merging. In fact we can improve on oscillating sort in this 
case (as suggested by Colin J. Bell in 1962), merging a new initial run into the 
output every time we merge from the working tapes. 

• We have observed that multireel files should be sorted one reel at a time, 
in order to avoid excessive tape handling. This is sometimes called a “reel time” 
application. Actually a balanced merge on six tapes can sort three reelfuls, up 
until the time of the final merge, if it has been programmed carefully. 

To merge a fairly large number of individually sorted reels, a minimum- 
path-length merging tree will be fastest (see Section 5.4.4). This construction 
was first made by E. H. Friend [JACM 3 (1956), 166-167]; then W. H. Burge 
[Information and Control 1 (1958), 181 197] pointed out that an optimum way 
to merge runs of given (possibly unequal) lengths is obtained by constructing a 
tree with minimum weighted path length, using the run lengths as weights (see 
Sections 2. 3. 4. 5 and 5.4.9), if we ignore tape handling time. 


338 SORTING 


5.4.6 


• Our discussions have blithely assumed that we have direct control over 
the input/output instructions for tape units, and that no complicated operating 
system keeps us from using tape as efficiently as the tape designers intended. 
These idealistic assumptions give us insights into the tape merging problem, and 
may give some insights into the proper design of operating system interfaces, 
but we should realize that multiprogramming and multiprocessing can make the 
situation considerably more complicated. 

• The issues we have studied in this section were first discussed in print 
by E. H. Friend [JACM 3 (1956), 134 168], W. Zoberbier [Elektronische Daten- 
verarbeitung 5 (1960), 28-44], and M. A. Goetz [ Digital Computer User's Hand- 
book (New York: McGraw Hill, 1967), 1.292-1.320]. 

Summary. We can sum up what we have learned about the relative efficiencies 
of different approaches to tape sorting in the following way: 

Theorem A. It is difficult to decide which merge pattern is best in a given 
situation. | 

The examples we have seen in Chart A show how 100,000 randomly ordered 
100-character records (or 1 million 10-character records) might be sorted using 
six tapes under realistic assumptions. This much data fills about half of a tape, 
and it can be sorted in about 15 to 19 minutes on the MIXT tapes. However, there 
is considerable variation in available tape equipment, and running times for such 
a job could vary between about four minutes and about two hours on different 
machines of the 1970s. In our examples, about 3 minutes of the total time were 
used for initial distribution of runs and internal sorting; about 4^ minutes were 
used for the final merge and rewinding the output tape; and about 7^ to 11 1 
minutes were spent in intermediate stages of merging. 

Given six tapes that cannot read backwards, the best sorting method under 
our assumptions was the “tape-splitting polyphase merge” (example 4); and for 
tapes that do allow backward reading, the best method turned out to be read- 
backward polyphase with a complicated placement of dummy runs (example 7). 
Oscillating sort (example 9) was a close second. In both cases the cascade merge 
provided a simpler alternative that was only slightly slower (examples 5 and 8). 
In the read-forward case, a straightforward balanced merge (example 1) was 
surprisingly effective, partly by luck in this particular example but partly also 
because it spends comparatively little time rewinding. 

The situation would change somewhat if we had a different number of 
available tapes. 

Sort generators. Given the wide variability of data and equipment charac- 
teristics, it is almost impossible to write a single external sorting program that is 
satisfactory in a variety of different applications. And it is also rather difficult to 
prepare a program that really handles tapes efficiently. Therefore the preparation 
of sorting software is a particularly challenging job. A sort generator is a program 
that produces machine code specially tailored to particular sorting applications, 



9 10 

Time (minutes) 







Chart A. Tape merging. 


Legend 

Y////X Reading in forward direction 
LVx'wi Reading in backward direction 
■■■ Writing in forward direction 

t : I Rewinding in backward direction 

loooi Operator changing tapes 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 339 


based on parameters that describe the data format and the hardware configura- 
tion. Such a program is often tied to high-level languages such as COBOL or PL/I. 

One of the features normally provided by a sort generator is the ability 
to insert the user’s “own coding,” a sequence of special instructions to be in- 
corporated into the first and last passes of the sorting routine. First-pass own 
coding is usually used to edit the input records, often shrinking them or slightly 
expanding them into a form that is easier to sort. For example, suppose that 
the input records are to be sorted on a nine-character key that represents a date 
in month-day- year format: 

JUL041T76 0CT311517 N0V051605 JUL141789 N0V071917 

On the first pass the three-letter month code can be looked up in a table, and 
the month codes can be replaced by numbers with the most significant fields at 
the left: 


17760704 15171031 16051105 17890714 19171107 

This decreases the record length and makes subsequent comparisons much sim- 
pler. (An even more compact code could also be substituted.) Last-pass own 
coding can be used to restore the original format, and/or to make other desired 
changes to the file, and/or to compute some function of the output records. The 
merging algorithms we have studied are organized in such a way that it is easy 
to distinguish the last pass from other merges. Notice that when own coding 
is present there must be at least two passes over the file even if it is initially 
in order. Own coding that changes the record size can make it difficult for the 
oscillating sort to overlap some of its input/output operations. 

Sort generators also take care of system details like tape label conventions, 
and they often provide for “hash totals” or other checks to make sure that none 
of the data has been lost or altered. Sometimes there are provisions for stopping 
the sort at convenient places and resuming later. The fanciest generators allow 
records to have dynamically varying lengths [see D. J. Waks, CACM 6 (1963), 
267-272]. 

*Merging with fewer buffers. We have seen that 2P + 2 buffers are sufficient 
to keep tapes moving rapidly during a P-way merge. Let us conclude this section 
by making a mathematical analysis of the merging time when fewer than 2P + 2 
buffers are present. 

Two output buffers are clearly desirable, since we can be writing from one 
while forming the next block of output in the other. Therefore we may ignore 
the output question entirely, and concentrate only on the input. 

Suppose there are P + Q input buffers, where 1 < Q < P. We shall use the 
following approximate model of the situation, as suggested by L. J. Woodrum 
[IBM Systems J. 9 (1970), 118-144]: It takes one unit of time to read a block of 
tape. During this time there is a probability p 0 that no input buffers have been 
emptied, p\ that one has been emptied, p > 2 that two or more have been, etc. 
When completing a tape read we are in one of Q + 1 states: 


340 SORTING 


5.4.6 


State 0. Q buffers are empty; we begin to read a block into one of them from the 
appropriate file, using the forecasting technique explained earlier in this section. 
After one unit of time we go to state 1 with probability po, otherwise we remain 
in state 0. 

State 1. Q- 1 buffers are empty; we begin to read into one of them, forecasting 
the appropriate file. After one unit of time we go to state 2 with probability p 0 , 
to state 1 with probability p i, and to state 0 with probability p >2 . 


State Q — 1. One buffer is empty; we begin to read into it, forecasting the 
appropriate file. After one unit of time we go to state Q with probability p 0 , to 
state Q — 1 with probability pi, . . . , to state 1 with probability pq~i, and to 
state 0 with probability p>Q. 

State Q. All buffers are filled. Tape reading stops for an average of p units of 
time and then we go to state Q — 1. 

We start in state 0. This model of the situation corresponds to a Markov 
process (see exercise 2.3.4.2-26), which can be analyzed via generating functions 
in the following interesting way: Let z be an arbitrary parameter, and assume 
that each time we have a chance to read from tape we make a decision to do so 
with probability z, but we decide to terminate the algorithm with probability 
l-z. Now let gqiz) — J2 n >o a n*^z n (l - z) be the average number of times that 
state Q occurs in such a process; it follows that a^' 1 is the average number of 
times state Q occurs when exactly n blocks have been read. Then n + a ( Q>p is 
the average total time for input plus computation. If we had perfect overlap, as 
in the (2 P + 2)-buffer algorithm, the total time would be only n units, so a^p 
represents the “reading hangup” time. 

Let Atj be the probability that we go from state i to state j in this process, 
for 0 < i,j < Q + 1, where Q + 1 is a new “stopped” state. For example, the 
A-matrix takes the following forms for small Q: 


^ P>1Z 

PoZ 

1 - 

A 




1 

0 

0 


> 



V 0 

0 

0 

) 




(P> 1Z 

PoZ 

0 

1 

- 

A 


P>2Z 

PlZ 

PoZ 

1 

- 



0 

1 

0 


0 



V 0 

0 

0 


0 

) 


f P>\Z 

PoZ 

0 


0 

l 

-z\ 

P>2Z 

PlZ 

PoZ 


0 

l 

— z 

P>3Z 

p 2 Z 

PlZ 

PoZ 

l 

— z 

0 

0 

1 


0 


0 

0 

0 

0 


0 


0 


5.4.6 


PRACTICAL CONSIDERATIONS FOR TAPE MERGING 


341 


Exercise 2.3.4.2-26(b) tells us that gqiz) = cofactorQ 0 (/ - A)/det(J — A). Thus 
for example when Q = 1 we have 


( 0 -p 0 z z-l\ / /l -p>iz -p 0 z z 1 \ 
10 0 I / det I -1 1 0 j 

00 1 // \ 0 0 1 / 


Pqz 

1 -p>iz-p 0 z 


P 0Z V' n/-, \ 

yZZ = npoz (1 -z), 

n> 0 


so o,n ^ — npo . This of course was obvious a priori, since the problem is very 
simple when Q = 1. A similar calculation when Q — 2 (see exercise 14) gives 
the less obvious formula 


,( 2 ) 


pin 


Po(l~Pi) 

1-P! (1 - Pi) 2 ' 


(10) 


In general we can show that has the form a ( A) n + 0(1) as n H> oc, where 
the constant is not terribly difficult to calculate. (See exercise 15.) It turns 
out that =p 3 /((l - p x ) 2 -p 0 p 2 ). 

The nature of merging makes it fairly reasonable to assume that p = 1/P 
and that we have a binomial distribution 


Pk 


(X)‘(^) 


P-k 


For example, when P = 5 we have p 0 — .32768, pi = .4096, p 2 = .2048, 
P 3 = .0512, p 4 = .0064, and ps = .00032; hence « 0.328, a ^ & 0.182, and 
a <3> ~ 0.125. In other words, if we use 5 + 3 input buffers instead of 5 + 5, we 
can expect an additional “reading hangup” time of about 0.125/5 « 2.5 percent. 

Of course this model is only a very rough approximation; we know that when 
Q = P there is no hangup time at all, but the model says that there is. The 
extra reading hangup time for smaller Q just about counterbalances the savings 
in overhead gained by having larger blocks, so the simple scheme with Q = P 
seems to be vindicated. 


EXERCISES 

1. [13] Give a formula for the exact number of characters per tape, when every block 
on the tape contains n characters. Assume that the tape could hold exactly 23000000 
characters if there were no interblock gaps. 

2 . [15] Explain why the first buffer for File 2, in line 6 of Fig. 84, is completely 
blank. 

3 . [20] Would Algorithm F work properly if there were only 2 P — 1 input buffers 
instead of 2 PI If so, prove it; if not, give an example where it fails. 

4 . [20] How can Algorithm F be changed so that it works also when P = 1 ? 

► 5 . [21] When equal keys are present on different files, it is necessary to be very 
careful in the forecasting process. Explain why, and show how to avoid difficulty by 
defining the merging and forecasting operations of Algorithm F more precisely. 


342 SORTING 


5.4.6 


6. [22] What changes should be made to Algorithm 5.4.3C in order to convert it 
into an algorithm for cascade merge with rewind overlap, on T + 1 tapes? 

► 7 . [26] The initial distribution in example 7 of Chart A produces 

(A 1 D 1 ) 11 D X (A 1 D 1 ) 10 D 1 (A 1 D 1 ) 9 Di(AiDi ) 7 

on tapes 1-4, where (AiDi) 7 means A 1 D 1 A l D 1 A 1 D 1 A 1 D 1 A 1 D 1 A 1 D x A 1 D 1 . Show 
how to insert additional Ao’s and Do’s in a “best possible” way (in the sense that 
the overall number of initial runs processed while merging is minimized), bringing the 
distribution up to 

A(DA ) 14 (DA) 28 (DA) 26 (DA) 22 . 

Hint: To preserve parity it is necessary to insert many of the A 0 ’s and D 0 ’s as adjacent 
pairs. The merge numbers for each initial run may be computed as in exercise 5. 4. 4-5; 
some simplification occurs since adjacent runs always have adjacent merge numbers. 

8. [20] Chart A shows that most of the schemes for initial distribution of runs (with 
the exception of the initial distribution for the cascade merge) tend to put consecutive 
runs onto different tapes. If consecutive runs went onto the same tape we could save the 
stop/ start time; would it therefore be a good idea to modify the distribution algorithms 
so that they switch tapes less often? 

► 9. [22] Estimate how long the read-backward polyphase algorithm would have taken 
in Chart A, if we had used all T = 6 tapes for sorting, instead of T = 5 as in example 7. 
Was it wise to avoid using the input tape? 

10. [M23] Use the analyses in Sections 5.4.2 and 5.4.3 to show that the length of 
each rewind during a standard six-tape polyphase or cascade merge is rarely more than 
about 54 percent of the file (except for the initial and final rewinds, which cover the 
entire file). 

11. [23] By modifying the appropriate entries in Table 1, estimate how long the first 
nine examples of Chart A would have taken if we had a combined low speed/high speed 
rewind. Assume that p = 1 when the tape is less than about one-fourth full, and that 
the rewind time for fuller tapes is approximately five seconds plus the time that would 
be obtained for p = | . Change example 8 so that it uses cascade merge with copying, 
since rewinding and reading forward is slower than copying in this case. [Hint: Use the 
result of exercise 10.] 

12. [ 40 ] Consider partitioning six tapes into three pairs of tapes, with each pair 
playing the role of a single tape in a polyphase merge with T = 3. One tape of each 
pair will contains blocks 1, 3, 5, . . . and the other tape will contain blocks 2, 4, 6, ... ; in 
this way we can essentially have two input tapes and two output tapes active at all 
times while merging, effectively doubling the merging speed. 

a) Find an appropriate way to extend Algorithm F to this situation. How many 
buffers should there be? 

b) Estimate the total running time that would be obtained if this method were used 
to sort 100,000 100-character records, considering both the read-forward and read- 
backward cases. 

13. [20] Can a five-tape oscillating sort, as defined in Algorithm 5.4.5B, be used to 
sort four reelfuls of input data, up until the time of the final merge? 

14. [Ml 9] Derive ( 10 ). 


5.4.7 


EXTERNAL RADIX SORTING 343 


15. [HM29] Prove that g Q (z) = hq(z)/{ 1 - z), where h Q (z) is a rational function of z 
having no singularities inside the unit circle; hence a ( „ Q) = h Q (l)n + 0(1) asnAoo. 
In particular, show that 



(0 

-Po 

0 

0 ^ 

/ 


-Po 

0 

0 \ 

det 

0 

1 - Pi 

-Po 

0 

/ det 

1 

1 - pi 

-Po 

0 


0 

-P2 

1 - Pi 

-Po 

/ 

1 

-p 2 

1 - Pi 

-po 


U 

0 

0 

0 ) 


\o 

0 

-1 

1 / 


16. [41] Carry out detailed studies of the problem of sorting 100,000 100-character 
records, drawing charts such as those in Chart A, assuming that 3, 4, or 5 tapes are 
available. 

*5.4.7. External Radix Sorting 

The previous sections have discussed the process of tape sorting by merging; 
but there is another way to sort with tapes, based on the radix sorting principle 
that was once used in mechanical card sorters (see Section 5.2.5). This method 
is sometimes called distribution sorting, column sorting, pocket sorting, digital 
sorting, separation sorting, etc.; it turns out to be essentially the opposite of 
merging! 

Suppose, for example, that we have four tapes and that there are only eight 
possible keys: 0, 1, 2, 3, 4, 5, 6, 7. If the input data is on tape Tl, we can begin 
by transferring all even keys to T3, all odd keys to T4: 

T2 T3 T4 

{0,2, 4, 6} {1,3, 5, 7} 

Now we rewind, and read T3 and then T4, putting {0, 1, 4, 5} on Tl and 
{2, 3, 6, 7} on T2: 

Pass 2 {0, 4}{1, 5} {2,6}{3, 7} — — 

(The notation “{0, 4}{1,5}” stands for a file that contains some records whose 
keys are all 0 or 4 followed by records whose keys are all 1 or 5. Notice that Tl 
now contains those keys whose middle binary digit is 0.) After rewinding again 
and distributing 0, 1, 2, 3 to T3 and 4, 5, 6, 7 to T4, we have 

Pass 3 {0}{1}{2}{3} {4}{5}{6}{7} 

Now we can finish up by copying T4 to the end of T3. In general, if the keys 
range from 0 to 2 k — 1, we could sort the file in an analogous way using k passes, 
followed by a final collection phase that copies about half of the data from one 
tape to another. With six tapes we could use radix 3 representations in a similar 
way, to sort keys from 0 to 3 fc - 1 in k passes. 

Partial-pass methods can also be used. For example, suppose that there 
are ten possible keys {0, 1, ... , 9), and consider the following procedure due to 


Tl 

Given {0, 1, 2, 3, 4, 5, 6, 7} 
Pass 1 — 


344 SORTING 


5.4.7 


R. L. Ashenhurst [Theory of Switching, Progress Report BL-7 (Harvard Univ, 
Comp. Laboratory: May 1954), I.1-I.76]: 


Phase Tl 

T2 

T3 

T4 

passes 

1 

{0,1,..., 9} 

{0,2, 4, 7} 

{1,5,6} 

{3,8,9} 

1.0 

2 

{0} 

— 

{1, 5, 6}{2, 7} 

{3, 8, 9}{4} 

0.4 

3 

{0} { 1 } {2} 

{6}{7} 

— 

{3, 8, 9}{4}{5} 

0.5 

4 

{0}{1}{2}{3} 

{6}{7}{8} 

{9} 

{4}{5} 

0.3 

C 

{0}{1}{2}{3}{4}.. 

.{9} 


0.6 


2.8 

Here C represents the collection phase. If each key value occurs about one-tenth 
of the time, the procedure above takes only 2.8 passes to sort ten keys, while the 
first example required 3.5 passes to sort only eight keys. Therefore we find that 
a clever distribution pattern can make a significant difference, for radix sorting 
as well as for merging. 

The distribution patterns in the examples above can conveniently be repre- 
sented as tree structures: 



The circular internal nodes of these trees are numbered 1, 2, 3, ... , corresponding 
to steps 1, 2, 3, . . . of the process. Tape names A, B, C, D (instead of Tl, T2, 
T3, T4) have been placed next to the lines of the trees, in order to show where 
the records go. Square external nodes represent portions of a file that contain 
only one key, and that key is shown in boldface type just below the node. The 
lines just above square nodes all carry the name of the output tape ( C in the 
first example, A in the second). 

Thus, step 3 of example 1 consists of reading from tape D and writing Is 
and 5s on tape A, 3s and 7s on tape B. It is not difficult to see that the number 
of passes performed is equal to the external path length of the tree divided by 
the number of external nodes, if we assume that each key occurs equally often. 


5.4.7 


EXTERNAL RADIX SORTING 345 


Because of the sequential nature of tape, and the first-in-first-out discipline 
of forwards reading, we can’t simply use any labeled tree as the basis of a 
distribution pattern. In the tree of example 1, data gets written on tape A 
during step 2 and step 3; it is necessary to use the data written during step 2 
before we use the data written during step 3. In general if we write onto a tape 
during steps i and j, where i < j, we must use the data written during step i 
first; when the tree contains two branches of the form 




i < j, 


we must have k < l. Furthermore we cannot write anything onto tape A between 
steps k and Z, because we must rewind between reading and writing. 

The reader who has worked the exercises of Section 5.4.4 will now immedi- 
ately perceive that the allowable trees for read-forward radix sorting on T tapes 
are precisely the strongly T-fifo trees , which characterize read-forward merge 
sorting on T tapes! (See exercise 5.4.4-20.) The only difference is that all of 
the external nodes on the trees we are considering here have the same tape 
labels. We could remove this restriction by assuming a final collection phase 
that transfers all records to an output tape, or we could add that restriction to 
the rules for T-fifo trees by requiring that the initial distribution pass of a merge 
sort be explicitly represented in the corresponding merge tree. 

In other words, every merge pattern corresponds to a distribution pattern, 
and every distribution pattern corresponds to a merge pattern. A moment’s 
reflection shows why this is so, if we consider the actions of a merge sort and 
imagine that time could run backwards: The final output is “unmerged” into 
subfiles, which are unmerged into others, etc.; at time zero the output has been 
unmerged into S runs. Such a pattern is possible with tapes if and only if 
the corresponding radix sort distribution pattern, for S keys, is possible. This 
duality between merging and distribution is almost perfect; it breaks down only 
in one respect, namely that the input tape must be saved at different times. 

The eight-key example treated at the beginning of this section is clearly 
dual to a balanced merge on four tapes. The ten-key example with partial 
passes corresponds to the following ten-run merge pattern (if we suppress the 
copy phases, steps 6-11 in the tree): 



T1 

T2 

T3 

T4 

Initial distribution 

l 4 

l 3 

l 1 

l 2 

Tree step 5 

l 3 

l 2 

— 

l 2 3 x 

Tree step' 4 

l 2 

l 1 

2 1 

1 2 3 x 

Tree step 3 

l 1 

— 

2 X 3 X 

1 1 3 1 

Tree step 2 

— 

4 1 

3 1 

3 1 

Tree step 1 

10 1 

— 

— 

— 

If we compare this to the radix sort 

, we 

see that the methods h< 


the same structure but are reversed in time, with the tape contents also reversed 


346 SORTING 


5.4.7 


from back to front: l 2 3 x (two runs each of length 1 followed by one of length 3) 
corresponds to {3, 8, 9}{4}{5} (two subfiles containing one key each, preceded 
by one subfile containing three). 

Going the other way, we can in principle construct a radix sort dual to 
polyphase merge, another one dual to cascade merge, etc. For example, the 
21-run polyphase merge on three tapes, illustrated at the beginning of Section 
5.4.2, corresponds to the following interesting radix sort: 


Phase T1 

{ 0 , 1 , ..., 20 } 

1 — 

2 {0,5,10,13,18} 

, {0,5,10,13,18}{1,6,11,14,19} 

{2,7,12,15,20} 

4 — 

5 {8} {9} {10} {11} {12} 

6 {8}{9}{10}{11}{12}{13}...{20} 


T2 

{0,2,4,5,7,9,10,12,13,15,17,18,20} 

{3,8,16}{4,9,17} 

{3,8,16}{4,9, 17}{5,10,18} 
{6,11,19}{7,12,20} 

{0}{ 1} — {7} 


T3 

{1,3,6,8,11,14,16,19} 

{1,3,6,8,11,14,16,19} 

{2,4,7,9,12,15,17,20} 

{0,13} {1,14} {2, 15} 

{0,13}{1,14}{2,15} 
{3, 16}. ..{7, 20} 


The distribution rule used here to decide which keys go on which tapes at 
each step appears to be magic, but in fact it has a simple connection with the 
Fibonacci number system. (See exercise 2.) 


Reading backwards. Duality between radix sorting and merging applies also 
to algorithms that read tapes backwards. We have defined “T-lifo trees” in 
Section 5.4.4, and it is easy to see that they correspond to radix sorts as well as 
to merge sorts. 

A read-backward radix sort was actually considered by John Mauchly al- 
ready in 1946, in one of the first papers ever to be published about sorting 
(see Section 5.5); Mauchly essentially gave the following construction: 


Phase T1 

T2 

{0,1,2,. ..,9} 

T3 

T4 

1 

{4,5} 

— 

{2, 3, 6, 7} 

{0,1, 8, 9} 

2 

{4, 5}{2, 7} 

{3,6} 

— 

{0,1, 8, 9} 

3 

{4, 5}{2, 7}{0, 9} 

{3, 6}{1, 8} 

— 

— 

4 

{4, 5}{2, 7} 

{3, 6}{1, 8} 

{9} 

{0} 

8 

— 

— 

{9}{8}{7}{6}{5} 

{0}{1}{2}{3}{4} 

C 

— 

— 

— 

{0}{1}{2}{3}{4}{5}...{9} 


His scheme is not the most efficient one possible, but it is interesting because 
it shows that partial pass methods were considered for radix sorting already in 
1946, although they did not appear in the literature for merging until about 1960. 

An efficient construction of read-backward distribution patterns has been 
suggested by A. Bayes [CACM 11 (1968), 491-493]: Given P + 1 tapes and 
5 keys, divide the keys into P subfiles each containing [S/PJ or \S/P] keys, 


5.4.7 


EXTERNAL RADIX SORTING 347 


and apply this procedure recursively to each subfile. When S < 2 P, one subfile 
should consist of the smallest key alone, and it should be written onto the output 
file. (R. M. Karp’s general preorder construction, which appears at the end of 
Section 5.4.4, includes this method as a special case.) 

Backward reading makes merging a little more complicated because it re- 
verses the order of runs. There is a corresponding effect on radix sorting: The 
outcome is stable or “anti-stable” depending on what level is reached in the tree. 
After a read-backward radix sort in which some of the external nodes are at odd 
levels and some are at even levels, the relative order of different records with 
equal keys will be the same as the original order for some keys, but it will be 
the opposite of the original order for the other keys. (See exercise 6.) 

Oscillating merge sorts have their counterparts too, under duality. In an 
oscillating radix sort we continue to separate out the keys until reaching subfiles 
that have only one key or are small enough to be internally sorted; such subfiles 
are sorted and written onto the output tape, then the separation process is 
resumed. For example, if we have three work tapes and one output tape, and if 
the keys are binary numbers, we may start by putting keys of the form Ox on tape 
Tl, keys lx on T2. If T1 receives more than one memory load, we scan it again 
and put 00a: on T2 and Ola: on T3. Now if the 00a: subfile is short enough to be 
internally sorted, we do so and output the result, then continue by processing 
the Ola: subfile. Such a method was called a “cascading pseudo-radix sort” by 
E. H. Friend [JACM 3 (1956), 157-159]; it was developed further by H. Nagler 
[JACM 6 (1959), 459-468], who gave it the colorful name “amphisbaenic sort,” 
and by C. H. Gaudette [IBM Tech. Disclosure Bull. 12 (April 1970), 1849-1853]. 
Does radix sorting beat merging? One important consequence of the duality 
principle is that radix sorting is usually inferior to merge sorting. This happens 
because the technique of replacement selection gives merge sorting a definite 
advantage; there is no apparent way to arrange radix sorts so that we can make 
use of internal sorts encompassing more than one memory load at a time. Indeed, 
the oscillating radix sort will often produce subfiles that are somewhat smaller 
than one memory load, so the distribution pattern will correspond to a tree with 
many more external nodes than would be present if merging and replacement 
selection were used. Consequently the external path length of the tree — the 
sorting time — will be increased. (See exercise 5.3.1-33.) 

On the other hand, external radix sorting does have its uses. Suppose, 
for example, that we have a file containing the names of all employees of a 
large corporation, in alphabetic order; the corporation has 10 divisions, and 
it is desired to sort the file by division, retaining the alphabetic order of the 
employees in each division. This is a perfect situation in which to apply a stable 
radix sort, if the file is long, since the number of records that belong to each 
of the 10 divisions is likely to be more than the number of records that would 
be obtained in initial runs produced by replacement selection. In general, if the 
range of key values is so small that the collection of records having a given key 
is expected to fill the internal memory more than twice, it is wise to use a radix 
sort technique. 


348 SORTING 


5.4.7 


We have seen in Section 5.2.5 that internal radix sorting is superior to 
merging, on certain high-speed computers, because the inner loop of the radix 
sort algorithm avoids complicated branching. If the external memory is especially 
fast, it may be impossible for such machines to merge data rapidly enough to 
keep up with the input/output equipment. Radix sorting may therefore turn out 
to be superior to merging in such a situation, especially if the keys are known to 
be uniformly distributed. 

EXERCISES 

1. [20] The general T-tape balanced merge with parameter P, 1 < P < T, was 
defined near the beginning of Section 5.4. Show that this corresponds to a radix sort 
based on a mixed-radix number system. 

2. [M28] The text illustrates the three-tape polyphase radix sort for 21 keys. Gener- 
alize to the case of F n keys; explain what keys appear on what tapes at the end of each 
phase. [Hint: Consider the Fibonacci number system, exercise 1.2.8-34.] 

3. [M35] Extend the results of exercise 2 to the polyphase radix sort on four or more 
tapes. (See exercise 5.4.2-10.) 

4. [ M23 ] Prove that Ashenhurst’s distribution pattern is the best way to sort 10 
keys on four tapes without reading backwards, in the sense that the associated tree has 
minimum external path length over all strongly 4-fifo trees. (Thus, it is essentially the 
best method if we ignore rewind time.) 

5. [15] Draw the 4-lifo tree corresponding to Mauchly’s read-backwards radix sort 
for 10 keys. 

► 6. [20] A certain file contains two-digit keys 00, 01, ..., 99. After performing 
Mauchly s radix sort on the least significant digits, we can repeat the same scheme 
on the most significant digits, interchanging the roles of tapes T2 and T4. In what 
order will the keys finally appear on T2? 

7. [ 21 ] Does the duality principle apply also to multireel files? 

*5.4.8. Two-Tape Sorting 

Since we need three tapes to carry out a merge process without excessive tape 
motion, it is interesting to speculate about how we could perform a reasonable 
external sort using only two tapes. 

One approach, suggested by H. B. Demuth in 1956, is sort of a combined 
replacement-selection and bubble sort. Assume that the input is on tape Tl, 
and begin by reading P + 1 records into memory. Now output the record whose 
key is smallest, to tape T2, and replace it by the next input record. Continue 
outputting a record whose key is currently the smallest in memory, maintaining 
a selection tree or a priority queue of P + 1 elements. When the input is finally 
exhausted, the largest P keys of the file will be present in memory; output them 
in ascending order. Now rewind both tapes and repeat the process by reading 
from T2 and writing to Tl; each such pass puts at least P more records into 
their proper place. A simple test can be built into the program that determines 
when the entire file is in sort. At most \(N — 1)/ P] passes will be necessary. 


5.4.8 


TWO-TAPE SORTING 349 


A few moments’ reflection shows that each pass of this procedure is essen- 
tially equivalent to P consecutive passes of the bubble sort (Algorithm 5.2.2B). If 
an element has P or more inversions, it will be smaller than everything in the tree 
when it is input, so it will be output immediately — thereby losing P inversions. 
If an element has fewer than P inversions, it will go into the selection tree and 
will be output before all greater keys — thereby losing all its inversions. When 
P = 1, this is exactly what happens in the bubble sort, by Theorem 5.2.21. 

The total number of passes will therefore be [7/P], where I is the maxim um 
number of inversions of any ele ment. By the theory developed in Section 5.2.2, 
the average value of I is N - yj ttN/2 + 2/3 + 0(\/s/N ) . 

If the file is not too much larger than the memory size, or if it is nearly in 
order to begin with, this order-P bubble sort will be fairly rapid; in fact, such a 
method might be advantageous even when extra tape units are available, because 
scratch tapes must be mounted by a human operator. But a two-tape bubble 
sort will run quite slowly on fairly long, randomly ordered files, since its average 
running time will be approximately proportional to N 2 . 

Let us consider how this method might be implemented for the 100,000- 
record example of Section 5.4.6. We need to choose P intelligently, in order to 
compensate for interblock gaps while doing simultaneous reading, writing, and 
computing. Since the example assumes that each record is 100 characters long 
and that 100,000 characters will fit into memory, we can make room for two 
input buffers and two output buffers of size B by setting 

100(P + 1) + 4P = 100000. (i) 

Using the notation of Section 5.4.6, the running time for each pass will be about 

NCljt( 1 + p ), u = {B + i)/B. ( 2 ) 

Since the number of passes is inversely proportional to P, we want to choose B to 
be a multiple of 100 that minimizes the quantity u/P. Elementary calculus shows 
that this occurs when B is approximately a/ 249757 + 7 2 - 7, so we take B = 
3000, P = 879. Setting N = 100000 in the formulas above shows that the number 
of passes [7/P] will be about 114, and the total estimated running time will be 
approximately 8.57 hours (assuming for convenience that the initial input and 
the final output also have B — 3000). This represents approximately 0.44 reelfuls 
of data; a full reel would take about five times as long. Some improvements could 
be made if the algorithm were interrupted periodically, writing the records with 
largest keys onto an auxiliary tape that is dismounted, since such records are 
merely copied back and forth once they have been put into order. 

Application of quicksort. Another internal sorting method that traverses 
the data in a nearly sequential manner is the partition exchange or quicksort 
procedure, Algorithm 5.2.2Q. Can we adapt it to two tapes? [N. B. Yoash, 
CACM 8 (1965), 649.] 

It is not difficult to see how this can indeed be done, using backward reading. 
Assume that the two tapes are numbered 0 and 1, and imagine that the file is 


350 SORTING 


5.4.8 


laid out as follows: 

Tape 0 


Beginning 
of tape 
( “bottom” ) 


Tape 1 



Current 

position 

(“top”) 


Current 

position 

(“top”) 


Beginning 
of tape 
( “bottom” ) 


Each tape serves as a stack; putting them together like this makes it possible to 
view the file as a linear list in which we can move the current position left or 
right by copying from one stack to the other. The following recursive subroutines 
define a suitable sorting procedure: 

• S0RT00 [Sort the top subfile on tape 0 and return it to tape 0]. 

If the subfile fits in the internal memory, sort it internally and return it to tape. 
Otherwise select one record R from the subfile, and let its key be K. Reading 
backwards on tape 0, copy all records whose key is > K, forming a new subfile 
on the top of tape 1. Now read forward on tape 0, copying all records whose key 
is — K onto tape 1. Then read backwards again, copying all records whose key is 
< K onto tape 1. Complete the sort by executing S0RT10 on the < K keys, then 
copying the = K keys to tape 0, and finally executing S0RT10 on the > K keys. 

• S0RT01 [Sort the top subfile on tape 0 and write it on tape 1], 

Same as S0RT00, but the final “S0RT10” is changed to “S0RT11” followed by 
copying the < K keys to tape 1. 

• S0RT10 [Sort the top subfile on tape 1 and write it on tape 0]. 

Same as S0RT01, interchanging 0 with 1 and < with >. 

• S0RT11 [Sort the top subfile on tape 1 and return it to tape 1]. 

Same as S0RT00, interchanging 0 with 1 and < with >. 

The recursive nature of these subroutines can be handled without difficulty by 
storing appropriate control information on the tapes. 

The running time for this algorithm can be estimated as follows, if we assume 
that the data are in random order, with negligible probability of equal keys. Let 
M be the number of records that fit into internal memory. Let X N be the 
average number of records read while applying S0RT00 or S0RT11 to a subfile of 
N records, when N > M, and let Yn be the corresponding quantity for S0RT01 
or S0RT10. Then we have 


Y _ JO, 

if N < M: 


N | 3IV+ 1 + £ 0 < fc<iV (Efc + yjv-i-fe), 

i(N>M- 




(3) 

v _ / °, 

if N < M; 


N X 3N + 2 + jf So<fe<iv( y fe + x N~i-k + k), 

if N > M. 



The solution to these recurrences (see exercise 2) shows that the total amount of 
tape reading during the external partitioning phases will be 6§./VTn./V + O(N), 
on the average, as N — > oo. We also know from Eq. 5.2.2— (25) that the average 
number of internal sort phases will be 2(N + 1 )/{M + 2) - 1. 



5.4.8 


TWO-TAPE SORTING 351 


If we apply this analysis to the 100,000-record example of Section 5.4.6, 
using 25,000-character buffers and assuming that the sorting time is 2 nCuir 
for a subfile of n < M = 1000 records, we obtain an average sorting time of 
approximately 103 minutes (including the final rewind as in Chart A). Thus the 
quicksort method isn’t bad, on the average; but of course its worst case turns 
out to be even more awful than the bubble sort discussed above. Randomization 
will make the worst case extremely unlikely. 

Radix sorting. The radix exchange method (Algorithm 5.2.2R) can be adapted 
to two-tape sorting in a similar way, since it is so much like quicksort. The trick 
that makes both of these methods work is the idea of reading a file more than 
once, something we never did in our previous tape algorithms. 

The same trick can be used to do a conventional least-significant-digit-first 
radix sort on two tapes. Given the input data on Tl, we copy all records onto 
T2 whose key ends with 0 in binary notation; then after rewinding Tl we read it 
again, copying the records whose key ends with 1. Now both tapes are rewound 
and a similar pair of passes is made, interchanging the roles of Tl and T2, and 
using the second least significant binary digit. At this point Tl will contain all 
records whose keys are (. . . 00 ) 2 , followed by those whose keys are (. . . 01 ) 2 , then 
(. . . 10 ) 2 , then (. . . 11 ) 2 - If the keys are b bits long, we need only 2b passes over 
the file in order to complete the sort. 

Such a radix sort could be applied only to the leading b bits of the keys, for 
some judiciously chosen number 6; that would reduce the number of inversions 
by a factor of about 2 b , if the keys were uniformly distributed, so a few passes of 
the P- way bubble sort could then be used to complete the job. This approach 
reads tape in the forward direction only. 

A novel but somewhat more complicated approach to two-tape distribution 
sorting has been suggested by A. I. Nikitin and L. I. Sholmov [ Kibernetika 2, 6 
(1966), 79-84], Counts are made of the number of keys having each possible 
configuration of leading bits, and artificial keys fti, « 2 , • • . , based on these 
counts are constructed so that the number of actual keys lying between k, and 
K i+ i is between predetermined limits P\ and P 2 , for each i. Thus, M lies between 
\N/ P 2 I and \N/Pi\. If the leading bit counts do not give sufficient information 
to determine such Ki, «2, • • • , K M, one or more further passes are made to count 
the frequency of less significant bit patterns, for certain configurations of most 
significant bits. After the table of artificial keys ki,k 2 , ...,Km has been con- 
structed, 2[lgM] further passes will suffice to complete the sort. (This method 
requires memory space proportional to N, so it can’t be used for external sorting 
as N — ► 00 . In practice we would not use the technique for multireel files, so M 
will be comparatively small and the table of artificial keys will fit comfortably 
in memory.) 

Simulation of more tapes. F. C. Hennie and R. E. Stearns have devised a 
general technique for simulating k tapes on only two tapes, in such a way that 
the tape motion required is increased by a factor of only 0(log L). where L is the 
maximum distance to be traveled on any one tape [JACM 13 (1966), 533-546]. 


352 SORTING 


5.4.8 


Track 1 
Track 2 
Track 3 
Track 4 


Zone 0 Zone 1 Zone 2 Zone 3 


IDI 

5 

9 

13 

17 

21 

25 

29 

33 

37 

41 

45 

4 ^ 

1 1 

6 

10 

14 

18 

22 

26 

30 

34 

38 

42 

46 

50/ 

3 

7 

11 

15 

19 

23 

27 

31 

35 

39 

43 

47 

51 ) 

4 

8 

12 

16 

20 

24 

28 

32 

36 

40 

44 

48 

52 ( 


Fig. 86. Layout of tape T1 in the Hennie-S teams construction; nonblank zones are 
shaded. 

Their construction can be simplified slightly in the case of sorting, as in the 
following method suggested by R. M. Karp. 

We shall simulate an ordinary four-tape balanced merge, using two tapes T1 
and T2. The first of these, Tl, holds the simulated tape contents in a way that 
may be diagrammed as in Fig. 86; we imagine that the data is written in four 
“tracks,” one for each simulated tape. (In actual fact the tape doesn’t have such 
tracks; blocks 1, 5, 9, 13, ... are thought of as Track 1, blocks 2, 6, 10, 14, ... 
as Track 2, etc.) The other tape, T2, is used only for auxiliary storage, to help 
move things around on Tl. 

The blocks of each track are divided into zones, containing, respectively, 
1, 2, 4, 8, . . . , 2 fc , . . . blocks per zone. Zone k on each track is either filled with 
exactly 2 k blocks of data, or it is completely blank. In Fig. 86, for example, 
Track 1 has data in zones 1 and 3; Track 2 in zones 0, 1, 2; Track 3 in zones 0 
and 2; Track 4 in zone 1; and the other zones are blank. 

Suppose that we are merging data from Tracks 1 and 2 to Track 3. The 
internal computer memory contains two buffers used for input to a two-way 
merge, plus a third buffer for output. When the input buffer for Track 1 becomes 
empty, we can refill it as follows: Find the first nonempty zone on Track 1, say 
zone k, and copy its first block into the input buffer; then copy the other 2 fc - 1 
blocks of data onto T2, and move them to zones 0, 1, ... , k — 1 of Track 1. (Zones 
0, 1, . . . , k — 1 are now full and zone k is blank.) An analogous procedure is used 
to refill the input buffer for Track 2, whenever it becomes empty. When the 
output buffer is ready to be written on Track 3, we reverse the process, scanning 
across Tl to find the first blank zone on Track 3, say zone k, while copying the 
data from zones 0, 1, ..., k- 1 onto T2. The data on T2, augmented by the 
contents of the output buffer, is now used to fill zone k of Track 3. 

This procedure requires the ability to write in the middle of tape Tl, without 
destroying subsequent information on that tape. As in the case of read-forward 
oscillating sort (Section 5.4.5), it is possible to do this reliably if suitable pre- 
cautions are taken. 

The amount of tape motion required to bring 2 l - 1 blocks of Track 1 into 
memory is Ylo<k<i 2 l ~ 1 ~ k ■ c-2 k = cl 2 l ~ 1 , for some constant c, since we scan up 
to zone k only once in every 2 k steps. Thus each merge pass requires 0{N log N) 
steps. Since there are O(logAZ') passes in a balanced merge, the total time to 


5.4.8 


TWO-TAPE SORTING 353 


sort is guaranteed to be 0(7V(log N) 2 ) in the worst case; this is asymptotically 
much better than the worst case of quicksort. 

But this method wouldn’t work very well if we applied it to the 100,000- 
record example of Section 5.4.6, since the information specified for tape T1 would 
overflow the contents of one tape reel. Even if we ignore this fact, and if we use 
optimistic assumptions about read/write/compute overlap and interblock gap 
lengths, etc., we find that roughly 37 hours would be required to complete the 
sort! So this method is purely of academic interest; the constant in O (N (log IV) 2 ) 
is much too high to be satisfactory when N is in a practical range. 

One-tape sorting. Could we live with only one tape? It is not difficult to see 
that the order-P bubble sort described above could be converted into a one-tape 
sort, but the result would be ghastly. 

H. B. Demuth [Ph.D. thesis (Stanford University, 1956), 85] observed that a 
computer with bounded internal memory cannot reduce the number of inversions 
of a permutation by more than a bounded amount as it moves a bounded distance 
on tape; hence every one-tape sorting algorithm must take at least N 2 d units of 
time on the average, for some positive constant d that depends on the computer 
configuration. 

R. M. Karp has pursued this topic in a very interesting way, discovering an 
essentially optimum way to sort with one tape. It is convenient to discuss Karp’s 
algorithm by reformulating the problem as follows: What is the fastest way 
to transport people between floors using a single elevator? [See Combinatorial 
Algorithms, edited by Randall Rustin (Algorithmics Press, 1972), 17-21.] 

Consider a building with n floors, having room for exactly b people on each 
floor. The building contains no doors, windows, or stairs, but it does have an 
elevator that can stop on each floor. There are bn people in the building, and 
exactly b of them want to be on each particular floor. The elevator holds at most 
to people, and it takes one unit of time to go from floor i to floor i ± 1. We 
wish to find the quickest way to get all the people onto the proper floors, if the 
elevator is required to start and finish on floor 1. 

The connection between this elevator problem and one-tape sorting is not 
hard to see: The people are the records and the building is the tape. The floors 
are individual blocks on the tape, and the elevator is the internal computer 
memory. A computer program has more flexibility than an elevator operator 
(it can, for example, duplicate people, or temporarily chop them into two parts 
on different floors, etc.); but the solution below solves the problem in the fastest 
conceivable time without doing such operations. 

The following two auxiliary tables are required by Karp’s algorithm. 

u k, 1 < k <n: Number of people on floors < k whose destination is > fc; 

(4; 

dk , 1 < k < n: Number of people on floors > k whose destination is < k. 

When the elevator is empty, we always have Uk = dfc+i for 1 < k < n, since there 
are b people on every floor; the number of misfits on floors { 1 , . . . , k} must equal 
the corresponding number on floors {fc+1, . . . , n}. By definition, u n = d\ = 0. 


354 


SORTING 


5.4.8 



Fig. 87. Karp’s elevator algorithm. 

It is clear that the elevator must make at least \u k /rn\ trips from floor k 
to floor k + 1, for 1 < k < n, since only m passengers can ascend on each trip. 
Similarly it must make at least \d k /m\ trips from floor k to floor k— 1. Therefore 
the elevator must necessarily take at least 

n 

Y^{\u k /m] + \d k /m]) (5) 

k = 1 

units of time on any correct schedule. Karp discovered that this lower bound 
can actually be achieved, when ui, . . . , u„-i are nonzero. 

Theorem K. Ifu k >0 for 1 < k < n, there is an elevator schedule that delivers 
everyone to the correct floor in the minimum time (5). 

Proof. Assume that there are m extra people in the building; they start in 
the elevator and their destination floor is artificially set to 0. The elevator can 
operate according to the following algorithm, starting with k (the current floor) 
equal to 1: 

Kl. [Move up.] From among the b + m people currently in the elevator or on 
floor k, those m with the highest destinations get into the elevator, and the 
others remain on floor k. 

Let there be u people now in the elevator whose destination is > k, 
and d whose destination is < k. (It will turn out that u = min(m, u k ); 
if u k < m we may therefore be transporting some people away from their 
destination. This represents their sacrifice to the common good.) Decrease 
u k by u, increase d k+ 1 by d, and then increase k by 1. 

K2. [Still going up?] If u > 0, return to step Kl. 

K3. [Move down.] From among the b + m people currently in the elevator or on 
floor k, those m with the lowest destinations get into the elevator, and the 
others remain on floor k. 

Let there be u people now in the elevator whose destination is > k, and 
d whose destination is < k. (It will always turn out that u — 0 and d — m, 
but the algorithm is described here in terms of general u and d in order to 
make the proof a little clearer.) Decrease dk by d. increase u k -i by u, and 
then decrease fc by 1. 






TWO-TAPE SORTING 355 


5.4.8 


Floor 9 
Floor 8 
Floor 7 
Floor 6 
Floor 5: 
Floor 4 
Floor 3 
Floor 2 
Floor 1 


45 


- 25 


- 19 


24 


- 89 


- 78 


- 13 


- 67 


889 

— f 77- 
778 

-f 66 — 


A 99 

899 458 

y 25 — ^-58 

899 245 

-f 18 58 — 

124 

V44- 


889 
-f 24- 


36 


667 

fl3- 

667 

— f 03 — 
036 

-f 00 


000 

/ 

Begin 


122 677 

122 266 
V 12 


A 88 

778 577 

677 556 

-y-44 y 56^66- 


445 456 455 

V24— V55 — 

244 
-\-44- 


122 

— y 23 — ^ 33 

112 123 122 

Voi-y-22 — 

011 

Vn- 

000 

\ 

End 


Fig. 88. An optimum way to rearrange people using a small, slow elevator. (People 
are each represented by the number of their destination floor.) 


K4. [Still going down?] If k > 1 and u^-i > 0, return to step K3. If k = 1 
and u\ = 0 , terminate the algorithm (everyone has arrived safely and the 
m “extras” are back in the elevator). Otherwise return to step K2. 

Figure 88 shows an example of this algorithm, with a nine-floor building and 
b = 2, m — 3. Note that one of the 6 s is temporarily transported to floor 7, in 
spite of the fact that the elevator travels the minimum possible distance. The 
idea of testing Uk-i in step K4 is the crux of the algorithm, as we shall see. 

To verify the validity of this algorithm, we note that steps K1 and K3 always 
keep the u and d tables ( 4 ) up to date, if we regard the people in the elevator as 
being on the “current” floor k. It is now possible to prove by induction that the 
following properties hold at the beginning of each step: 

ui = dj+i, for k <1 < n; ( 6 ) 

ui — di + 1 — m, for 1 < l < k; ( 7 ) 

u ; + 1 =0, if ui — 0 and k < l < n. (8) 

Furthermore, at the beginning of step Kl, the min (v,k, m) people with highest 

destinations, among all people on floors < k with destination > k, are in the 
elevator or on floor k. At the beginning of step K3, the min (d^, rn) people with 
lowest destinations, among all people on floors > k with destination < k, are in 
the elevator or on floor k. 

From these properties it follows that the parenthesized remarks in steps Kl 
and K3 are valid. Each execution of step Kl therefore decreases [itfc/m] by 1 
and leaves [rffc+i/m] unchanged; each execution of K3 decreases \dk/rn\ by 1 
and leaves \uk-\/rn\ unchanged. The algorithm must therefore terminate in a 
finite number of steps, and everybody must then be on the correct floor because 
of (6) and ( 8 ). | 


356 SORTING 


5.4.8 


When u k = 0 and u k+ \ > 0 we have a “disconnected” situation; the elevator 
must journey up to floor k + 1 in order to rearrange the people up there, even 
though nobody wants to move from floors < k to floors > k + 1. Without loss 
of generality, we may assume that u„_! > 0; then every valid elevator schedule 
must include at least 

2 ^ max(l, fttfc/m]) ' ( 9 ) 

l<fc<n 

moves, since we require the elevator to return to floor 1. A schedule achieving 
this lower bound is readily constructed (exercise 4). 

EXERCISES 

1. [20] The order-P bubble sort discussed in the text uses only forward reading and 
rewinding. Can the algorithm be modified to take advantage of backward reading? 

2. [ M26 ] Find explicit closed-form solutions for the numbers X N , Y N defined in (3). 
[Hint: Study the solution to Eq. 5.2.2-(ig).] 

3. [38] Is there a two-tape sorting method, based only on comparisons of keys (not 
digital properties), whose tape motion is 0(N log N) in the worst case, when sorting 
N records? [Quicksort achieves this on the average, but not in the worst case, and the 
Hennie-Stearns method (Fig. 86) achieves 0(N(logN) 2 ).] 

4. [ M23 ] In the elevator problem, suppose there are indices p and q, with q > p + 2, 
Up > 0, u q > 0, and u p +i = • • • = u q - 1 = 0. Explain how to construct a schedule 
requiring at most (9) units of time. 

► 5. [M23] True or false: After step K1 of the algorithm in Theorem K, nobody on 
the elevator has a lower destination than any person on floors < k. 

6. [M30] (R. M. Karp.) Generalize the elevator problem (Fig. 88) to the case that 
there are bj passengers initially on floor j, and b'j passengers whose destination is floor j, 
for 1 < j < n. Show that a schedule exists that takes 2]T”I 1 1 max(l, \u k /m ] , \d k+1 /m] ) 
units of time, never allowing more than max(h,,6j) passengers to be on floor j at any 
one time. [Hint: Introduce fictitious people, if necessary, to make bj = 6) for all j] 

7. [M40] (R. M. Karp.) Generalize the problem of exercise 6, replacing the linear 
path of an elevator by a network of roads to be traveled by a bus, given that the network 
forms any free tree. The bus has finite capacity, and the goal is to transport passengers 
to their destinations in such a way that the bus travels a minimum distance. 

8. [ M32 ] Let b = 1 in the elevator problem treated in the text. How many permu- 
tations of the n people on the n floors will make u k < 1 for 1 < k < n in (4)? [For 
example, 314592687 is such a permutation.] 

► 9. [ M25 ] Find a significant connection between the “cocktail-shaker sort” described 
in Section 5.2.2, Fig. 16, and the numbers u\, U 2 , . . . , u n of (4) in the case 6=1. 

10. [20] How would you sort a multireel file with only two tapes? 

*5.4.9. Disks and Drums 

So far we have considered tapes as the vehicles for external sorting, but more 
flexible types of mass storage devices are generally available. Although such 
bulk memory or “direct-access storage” units come in many different forms, 
they may be roughly characterized by the following properties: 


5.4.9 


DISKS AND DRUMS 357 


i) Any specified part of the stored information can be accessed quickly. 

ii) Blocks of consecutive words can be transmitted rapidly between the internal 
and external memory. 

Magnetic tape satisfies (ii) but not (i), because it takes a long time to get from 
one end of a tape to the other. 

Every external memory unit has idiosyncrasies that ought to be studied 
carefully before major programs are written for it; but technology changes so 
rapidly, it is impossible to give a complete discussion here of all the available 
varieties of hardware. Therefore we shall consider only some typical memory 
devices that illustrate useful approaches to the sorting problem. 

One of the most common types of external memories satisfying (i) and (ii) is 
a disk device (see Fig. 89). Data is kept on a number of rapidly rotating circular 
disks, covered with magnetic material; a comb-like access arm, containing one 
or more “read/write heads” for each disk surface, is used to store and retrieve 
the information. Each individual surface is divided into concentric rings called 
tracks , so that an entire track of data passes a read/write head every time the 
disk completes one revolution. The access arm can move in and out, shifting 
the read/write heads from track to track; but this motion takes time. A set 
of tracks that can be read or written without repositioning the access arm is 
called a cylinder. For example, Fig. 89 illustrates a disk unit that has just one 
read/write head per surface; the light gray circles show one of the cylinders, 
consisting of all tracks currently being scanned by the read/write heads. 



To fix the ideas, let us consider hypothetical MIXTEC disk units, for which 

1 track = 5000 characters 
1 cylinder = 20 tracks 

1 disk unit = 200 cylinders 

Such a disk unit contains 20 million characters, slightly less than the amount 
of data that can be stored on a single MIXT magnetic tape. On some machines, 
tracks near the center have fewer characters than tracks near the rim; this tends 


358 SORTING 


5.4.9 


to make the programming much more complicated, and MIXTEC fortunately 
avoids such problems. (See Section 5.4.6 for a discussion of MIXT tapes. As 
in that section, we are studying classical techniques by considering machine 
characteristics that were typical of the early 1970s; modern disks are much bigger 
and faster.) 

The amount of time required to read or write on a disk device is essentially 
the sum of three quantities: 

• seek time (the time to move the access arm to the proper cylinder); 

• latency time (rotational delay until the read/write head reaches the right spot); 

• transmission time (rotational delay while the data passes the read/write head). 

On MIXTEC devices the seek time required to go from cylinder i to cylinder j is 
25+ \\i~j\ milliseconds. If i and j are randomly selected integers between 1 and 
200, the average value of \i - j | is 2( 2 ° 1 )/200 2 ss 66.7, so the average seek time is 
about 60 ms. MIXTEC disks rotate once every 25 ms, so the latency time averages 
about 12.5 ms. The transmission time for n characters is (n/5000) x 25 ms = 
5nps. (This is about 3| times as fast as the transmission rate of the MIXT tapes 
that were used in the examples of Section 5.4.6.) 

Thus the main differences between MIXTEC disks and MIXT tapes are these: 

a) Tapes can only be accessed sequentially. 

b) Individual disk operations tend to require significantly more overhead (seek 
time + latency time compared to stop/start time). 

c) The disk transmission rate is faster. 

By using clever merge patterns on tape, we were able to compensate somewhat 
for disadvantage (a). Our goal now is to think of some clever algorithms for disk 
sorting that will compensate for disadvantage (b). 

Overcoming latency time. Let us consider first the problem of mi nimi sing 
the delays caused by the fact that the disks aren’t always positioned properly 
when we want to start an I/O command. We can’t make the disk spin faster, 
but we can still apply some tricks that reduce or even eliminate all of the latency 
time. The addition of more access arms would obviously help, but that would 
be an expensive hardware modification. Here are some software ideas: 

• If we read or write several tracks of a cylinder at a time, we avoid the 
latency time (and the seek time) on all tracks but the first. In general it is often 
possible to synchronize the computing time with the disk movement in such a 
way that a sequence of input/output instructions can be carried out without 
latency delays. 

• Consider the problem of reading half a track of data (Fig. 90): If the read 
command begins when the heads are at axis A, there is no latency delay, and the 
total time for reading is just the transmission time, \ x 25 ms. If the command 
begins with the heads at B, we need | of a revolution for latency and | for 
transmission, totalling ^ x 25 ms. The most interesting case occurs when the 


5.4.9 


DISKS AND DRUMS 359 



Fig. 90. Analysis of the latency time when reading half of a track. 

heads are initially at C: With proper hardware and software we need not waste 
| of a revolution for latency delay. Reading can begin immediately, into the 
second half of the input buffer; then after a \ x 25 ms pause, reading can resume 
into the first half of the buffer, so that the instruction is completed when axis C 
is reached again. In a similar manner, we can ensure that the total latency plus 
transmission time will never exceed the time for one revolution, regardless of the 
initial position of the disk. The average amount of latency delay is reduced by 
this scheme from half a revolution to | (I — x 2 ) of a revolution, if we are reading 
or writing a given fraction a; of a track, for 0 < x < 1. When an entire track is 
being read or written (x = 1), this technique eliminates all the latency time. 

Drums: The no-seek case. Some external memory units, traditionally called 
drum memories, eliminate the seek time by having one read/write head for every 
track. If the technique of Fig. 90 is employed on such devices, both seek time 
and latency time reduce to zero, provided that we always read or write a track 
at a time; this is the ideal situation in which transmission time is the only 
limiting factor. 

Let us consider again the example application of Section 5.4.6, sorting 
100,000 records of 100 characters each, with a 100,000-character internal memory. 
The total amount of data to be sorted fills half of a MIXTEC disk. It is usually 
impossible to read and write simultaneously on a single disk unit; we shall assume 
that two disks are available, so that reading and writing can overlap each other. 
For the moment we shall assume, in fact, that the disks are actually drums, 
containing 4000 tracks of 5000 characters each, with no seek time required. 

What sorting algorithm should be used? The method of merging is a fairly 
natural choice; other methods of internal sorting do not lend themselves so well 
to a disk implementation, except for the radix techniques of Section 5.2.5. The 
considerations of Section 5.4.7 show that radix sorting is usually inferior to 
merging for general-purpose applications, because the duality theorem of that 
section applies to disks as well as to tapes. Radix sorting does have a strong 
advantage, however, when the keys are uniformly distributed and many disks 
can be used in parallel, because an initial distribution by the most significant 
digits of the keys will divide the work up into independent subproblems that 
need no further communication. (See, for example, R. C. Agarwal, SIGMOD 
Record 25,2 (June 1996), 240-246.) 


360 SORTING 


5.4.9 


We will concentrate on merge sorting in the following discussion. To begin 
a merge sort for the stated problem we can use replacement selection, with two 
5000-character input buffers and two 5000-character output buffers. In fact, it is 
possible to reduce this to three 5000-character buffers, if records in the current 
input buffer are replaced by records that come off the selection tree. That leaves 
85,000 characters (850 records) for a selection tree, so one pass over our example 
data will form about 60 initial runs. (See Eq. 5.4.6-(3).) This pass takes only 
about 50 seconds, if we assume that the internal processing time is fast enough 
to keep up with the input/output rate, with one record moving to the output 
buffer every 500 microseconds. If the input to be sorted appeared on a MIXT 
tape, instead of a drum, this pass would be slower, governed by the tape speed. 

With two drums and full-track reading/writing, it is not hard to see that 
the total transmission time for P - way merging is minimized if we let P be as 
large as possible. Unfortunately we can’t simply do a 60-way merge on all of the 
initial runs, since there isn’t room for 60 buffers in memory. (A buffer of fewer 
than 5000 characters would introduce unwanted latency time. Remember that 
we are still pretending to be living in the 1970s, when internal memory space was 
significantly limited.) If we do P - way merges, passing all the data from one drum 
to the other so that reading and writing are overlapped, the number of merge 
passes is [log P 60], so we may complete the job in two passes if 8 < P < 59. 
The smallest such P reduces the amount of internal computing, so we choose 
P — 8; if 65 initial runs had been formed we would take P - 9. If 82 or more 
initial runs had been formed, we could take P = 10, but since there is room 
for only 18 input buffers and 2 output buffers there would be a possibility of 
hangup during the merge (see Algorithm 5.4.6F); it may be better in such a case 
to do two partial passes over a small portion of the data, reducing the number 
of initial runs to 81 or less. 

Under our assumptions, both of the merging passes will take about 50 
seconds, so the entire sort in this ideal situation will be completed in just 2.5 
minutes (plus a few seconds for bookkeeping, initialization, etc.). This is six 
times faster than the best six-tape sort considered in Section 5.4.6; the reasons 
for this speedup are the improved external/internal transmission rate (3.5 times 
faster), the higher order of merge (we can’t do an eight-way tape merge unless we 
have nine or more tapes), and the fact that the output was left on disk (no final 
rewind, etc., was necessary). If the initial input and sorted output were required 
to be on MIXT tapes, with the drums used for merging only, the corresponding 
sorting time would have been about 8.2 minutes. 

If only one drum were available instead of two, the input-output time would 
take twice as long, since reading and writing must be done separately. (In fact, 
the input-output operations might take three times as long, since we would be 
overwriting the initial input data; in such a case it is prudent to follow each write 
by a read-back check operation, lest some of the input data be irretrievably 
lost, if the hardware does not provide automatic verification of written informa- 
tion.) But some of this excess time can be recovered because we can use partial 
pass methods that process some data records more often than others. The two- 


5.4.9 


DISKS AND DRUMS 361 


drum case requires all data to be processed an even number or an odd number 
of times, but the one-drum case can use more general merge patterns. 

We observed in Section 5.4.4 that merge patterns can be represented by trees, 
and that the transmission time corresponding to a merge pattern is proportional 
to the external path length of its tree. Only certain trees (T-lifo or strongly 
T-fifo) could be used as efficient tape merging patterns, because some runs get 
buried in the middle of a tape as the merging proceeds. But on disks or drums, 
all trees define usable merge patterns if the degrees of their internal nodes are 
not too large for the available internal memory size. 

Therefore we can minimize transmission time by choosing a tree with mini- 
mum external path length, such as a complete P- ary tree where P is as large as 
possible. By Eq. 5.4.4-(g), the external path length of such a tree is equal to 

q s- L(P«-S )/( P-1)J, ^[logpS], (i) 

if there are 5 external nodes (leaves). 

It is particularly easy to design an algorithm that merges according to 
the complete P- ary tree pattern. See, for example, Fig. 91, which shows the 
case P = 3, S — 6. First we add dummy runs, if necessary, to make 5=1 
(modulo P — 1); then we combine runs according to a first-in-first-out discipline, 
at every stage merging the P oldest runs at the front of the queue into a single 
run that is placed at the rear. 



The complete P- ary tree gives an optimum pattern if all of the initial runs 
are the same length, but we can often do better if some runs are longer than 
others. An optimum pattern for this general situation can be constructed without 
difficulty by using Huffman’s method (exercise 2.3.4.5-10), which may be stated 
in merging language as follows: “First add (1 — 5) mod (P - 1) dummy runs of 
length 0. Then repeatedly merge together the P shortest existing runs until only 
one run is left.” When all initial runs have the same length this method reduces 
to the FIFO discipline described above. 

In our 100,000-record example we can do nine-way merging, since 18 input 
buffers and two output buffers will fit in memory and Algorithm 5.4.6F will 
overlap all compute time. The complete 9-ary tree with 60 leaves corresponds 
to a merging pattern with 1 1| passes, if all initial runs have the same length. 
The total sorting time with one drum, using read-back check after every write, 




362 SORTING 


5.4.9 


therefore comes to about 7.4 minutes. A higher value of P may reduce this 
running time slightly; but the situation is complicated because “reading hangup” 
might occur when the buffers become too full or too empty. 

The influence of seek time. Our discussion shows that it is relatively easy to 
construct optimum merging patterns for drums, because seek time and latency 
time can be essentially nonexistent. But when disks are used with small buffers 
we often spend more time seeking information than reading it, so the seek time 
has a considerable influence on the sorting strategy. Decreasing the order of 
merge, P , makes it possible to use larger buffers, so fewer seeks are required; 
this often compensates for the extra transmission time demanded by the smaller 
value of P. 

Seek time depends on the distance traveled by the access arm, and we could 
try to arrange things so that this distance is minimized. For example, it may be 
wise to sort the records within cylinders first. However, large-scale merging 
requires a good deal of jumping around between cylinders (see exercise 2). 
Furthermore, the multiprogramming capability of modern operating systems 
means that users tend to lose control over the position of disk access arms. 
We are often justified, therefore, in assuming that each disk command involves 
a “random” seek. 

Our goal is to discover a merge pattern that achieves the best balance 
between seek time and transmission time. For this purpose we need some way 
to estimate the goodness of any particular tree with respect to a particular 
hardware configuration. Consider, for example, the tree in Fig. 92; we want to 
estimate how long it will take to carry out the corresponding merge, so that we 
can compare this tree to other trees. 

In the following discussion we shall make some simple assumptions about 
disk merging, in order to illustrate some of the general ideas. Let us suppose that 
(i) it takes 72.5 + 0.005n milliseconds to read or write n characters; (ii) 100,000 
characters of internal memory are available for working storage; (iii) an average 
of 0.004 milliseconds of computation time are required to transmit each character 
from input to output; (iv) there is to be no overlap between reading, writing, 
or computing; and (v) the buffer size used on output need not be the same as 
the buffer size used to read the data on the following pass. An analysis of the 
sorting problem under these simple assumptions will give us some insights when 
we turn to more complicated situations. 

If we do a P - way merge, we can divide the internal working storage into P+1 
buffer areas, P for input and one for output, with B = 100000/(P+1) characters 
per buffer. Suppose the files being merged contain a total of L characters; then 
we will do approximately L/B output operations and about the same number 
of input operations, so the total merging time under our assumptions will be 
approximately 

2 ( 72 ' 5 ^ + °- 005L ) + °- 004i = (0.00145P + 0.01545)L ( 2 ) 


milliseconds. 


5.4.9 


DISKS AND DRUMS 


363 



Fig. 92. A tree whose external path length is 16 and whose degree path length is 52. 


In other words, a P- way merge of L characters takes about ( aP + (3)L uni ts 
of time, for some constants a and /3 depending on the seek time, latency time, 
compute time, and memory size. This formula leads to an interesting way to 
construct good merge patterns for disks. Consider Fig. 92, for example, and 
assume that all initial runs (represented by square leaf nodes) have length L 0 . 
Then the merges at nodes 9 and 10 each take (2a + /3)(2 L 0 ) units of time, the 
merge at node 11 takes (3a + /3)(4L 0 ), and the final merge at node 12 takes 
(4a + /3)(8Lo). The total merging time therefore comes to (52a + 16/3 )L 0 units. 
The coefficient “16” here is well-known to us, it is simply the external path 
length of the tree. The coefficient “52” of a is, however, a new concept, which 
we may call the degree path length of the tree; it is the sum, taken over all leaf 
nodes, of the internal-node degrees on the path from the leaf to the root. For 
example, in Fig. 92 the degree path length is 

(2 + 4) + (2 + 4) + (3 + 4) + (2 + 3 + 4) + (2 + 3 + 4) + (3 + 4) + (4) + (4) 

= 52. 

If T is any tree, let D(T) and E(T) denote its degree path length and its 
external path length, respectively. Our analysis may be summarized as follows: 

Theorem H. If the time required to do a P-way merge on L characters has 
the form (aP + /3)L, and if there are S equal-length runs to be merged, the best 
merge pattern corresponds to a tree T for which aD(T) +/3E(T) is a minimum, 
over all trees having S leaves. | 

(This theorem was implicitly contained in an unpublished paper that George U. 
Hubbard presented at the ACM National Conference in 1963.) 

Let a and (3 be fixed constants; we shall say a tree is optimal if it has the 
minimum value of aD{T) + f3E(T) over all trees, T , with the same number of 
leaves. It is not difficult to see that all subtrees of an optimal tree are optimal, 
and therefore we can construct optimal trees with n leaves by piecing together 
optimal trees with < n leaves. 







364 SORTING 


5.4.9 


Theorem K. Let the sequence of numbers A m (n) be defined for 1 < m < n by 
the rules 


Ai(l) = 0 ; 


(3) 

Am{n)= min (A 1 (k) + A m _i(n - k)), 

1 <k<n/m 

for 2 < m < n; 

(4) 

Ai{n) = min (amn + f3n + A m (n)), 

2 <m<n ' 

for n > 2 . 

(5) 


Then Ai(n) is the minimum value of aD(T ) + fiE(T). over all trees T with 
n leaves. 

Proof. Equation ( 4 ) implies that A m (n) is the minimum value of Ai(rii) H 1 - 

Ai(n m ) taken over all positive integers n 1; . . . ,n m such that ni H 1 - n m - n. 

The result now follows by induction on n. | 

The recurrence relations ( 3 ), ( 4 ), ( 5 ) can also be used to construct the 
optimal trees themselves: Let k m (n) be a value for which the minimum occurs 
in the definition of A m (n). Then we can construct an optimal tree with n leaves 
by joining m = fci(n) subtrees at the root; the subtrees are optimal trees with 
kmipf ^m(^)) j 2 (p ^m(^) k m — \{n ^m(R)))j ••• leaves, 

respectively. 

For example, Table 1 illustrates this construction when a = /3 = 1. A com- 
pact specification of the corresponding optimal trees appears at the right of the 
table; the entry “4:9:9” when n = 22 means, for example, that an optimal tree 
722 with 22 leaves may be obtained by combining Ti,%, and 7g (see Fig. 93). 
Optimal trees are not unique; for instance, 5:8:9 would be just as good as 4:9:9. 



Fig. 93. An optimum way to merge 22 initial runs of equal length, when a = ft in 
Theorem H. This pattern minimizes the seek time, under the assumptions leading to 
Eq. (2) in the text. 

Our derivation of ( 2 ) shows that the relation a < (3 will hold whenever 
P + 1 equal buffer areas are used. The limiting case a = /?, shown in Table 1 
and Fig. 93, occurs when the seek time itself is to be minimized without regard 
to transmission time. 

Returning to our original application, we still haven’t considered how to 
get the initial runs in the first place; without read/ write/compute overlap, 
replacement selection loses some of its advantages. Perhaps we should fill the 
entire internal memory, sort it, and output the results; such input and output 


5.4.9 


DISKS AND DRUMS 365 


Table 1 

OPTIMAL TREE CHARACTERISTICS A m ( n ), k m ( n ) WHEN a = fj = 1 


m 


n 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 ' 

Tree 

n 

1 

0,0 












— 

1 

2 

6,2 

0,1 











1:1 

2 

3 

12,3 

6,1 

0,1 










1:1:1 

3 

4 

20,4 

12,1 

6,1 

0,1 









1:1:1: 1 

4 

5 

30,5 

18,2 

12,1 

6,1 

0,1 








1:1:1:1:1 

5 

6 

42,2 

24,3 

18,1 

12,1 

6,1 

0,1 







3:3 

6 

7 

52,3 

32,3 

24,1 

18,1 

12,1 

6,1 

0,1 






1:3:3 

7 

8 

62,3 

40,4 

30,2 

24,1 

18,1 

12,1 

6,1 

0,1 





2:3:3 

8 

9 

72,3 

50,4 

36,3 

30,1 

24,1 

18,1 

12,1 

6,1 

0,1 




3:3:3 

9 

10 

84,3 

60,5 

44,3 

36,1 

30,1 

24,1 

18,1 

12,1 

6,1 

0,1 



3:3:4 

10 

11 

96,3 

72,4 

52,3 

42,2 

36,1 

30,1 

24,1 

18,1 

12,1 

6,1 

0,1 


3:4:4 

11 

12 

108,3 

82,4 

60,4 

48,3 

42,1 

36,1 

30,1 

24,1 

18,1 

12,1 

6,1 

0,1 

4:4:4 

12 

13 

121,4 

92,4 

70,4 

56,3 

48,1 

42,1 

36,1 

30,1 

24,1 

18,1 

12,1 

6,1 

3:3:3:4 

13 

14 

134,4 

102,5 

80,4 

64,3 

54,2 

48,1 

42,1 

36,1 

30,1 

24,1 

18,1 

12,1 

3:3:4:4 

14 

15 

147,4 

114,5 

90,4 

72,3 

60,3 

54,1 

48,1 

42,1 

36,1 

30,1 

24,1 

18,1 

3:4:4:4 

15 

16 

160,4 

124,7 

102,4 

80,4 

68,3 

60,1 

54,1 

48,1 

42,1 

36,1 

30,1 

24,1 

4:4:4:4 

16 

17 

175,4 

134,8 

112,4 

90,4 

76,3 

66,2 

60,1 

54,1 

48,1 

42,1 

36,1 

30,1 

4:4:4:5 

17 

18 

190,4 

144,9 

122,4 

100,4 

84,3 

72,3 

66,1 

60,1 

54,1 

48,1 

42,1 

36,1 

4:4:5:5 

18 

19 

205,4 

156,9 

132,5 

110,4 

92,3 

80,3 

72,1 

66,1 

60,1 

54,1 

48,1 

42,1 

4:5:5:5 

19 

20 

220,4 

168,9 

144,4 

120,5 

100,4 

88,3 

78,2 

72,1 

66,1 

60,1 

54,1 

48,1 

5:5:5:5 

20 

21 

236,5 

180,9 

154,4 

132,4 

110,4 

96,3 

84,3 

78,1 

72,1 

66,1 

60,1 

54,1 

4:4:4:4:5 

21 

22 

252,3 

192,10 

164,4 

142,4 

120,4 

104,3 

92,3 

84,1 

78,1 

72,1 

66,1 

60,1 

4:9:9 

22 

23 

266,3 

204,11 

174,5 

152,4 

130,4 

112,3 

100,3 

90,2 

84,1 

78,1 

72,1 

66,1 

5:9:9 

23 

24 

282,3 

216,12 

186,5 

162,5 

140,4 

120,4 

108,3 

96,3 

90,1 

84,1 

78,1 

72,1 

5:9:10 

24 

25 

296,3 

229,12 

196,7 

174,4 

150,5 

130,4 

116,3 

104,3 

96,1 

90,1 

84,1 

78,1 

7:9:9 

25 


operations can each be done with one seek. Or perhaps we are better off using, 
say, 20 percent of the memory as a combination input/output buffer, and doing 
replacement selection. This requires five times as many seeks (an extra 60 
seconds or so!), but it reduces the number of initial runs from 100 to 64; the reduc- 
tion would be more dramatic if the input file were pretty much in order already. 

If we decide not to use replacement selection, the optimum tree for S = 100, 
a = 0.00145, p = 0.01545 [see ( 2 )] turns out to be rather prosaic: It is simply a 
10-way merge, completed in two passes over the data. Allowing 30 seconds for 
internal sorting (100 quicksorts, say), the initial distribution pass takes about 
2.5 minutes, and the merge passes each take almost 5 minutes, for a total of 
12.4 minutes. If we decide to use replacement selection, the optimal tree for 
S = 64 turns out to be equally uninteresting (two 8- way merge passes); the initial 
distribution pass takes about 3.5 minutes, the merge passes each take about 4.5 
minutes, and the estimated total time comes to 12.6 minutes. Remember that 
both of these methods give up virtually all read/write/compute overlap in order 
to have larger buffers, reducing seek time. None of these estimated times includes 
the time that might be necessary for read-back check operations. 

In practice the final merge pass tends to be quite different from the others; 
for example, the output is often expanded and/or written onto tape. In such 
cases the tree pattern should be chosen using a different optimality criterion at 
the root. 


366 SORTING 


5.4.9 


A closer look at optimal trees. It is interesting to examine the extreme case 
/3 = 0 in Theorems H and K, even though practical situations usually lead to 
parameters with 0 < a < /3. What tree with n leaves has the smallest possible 
degree path length? Curiously it turns out that three-way merging is best. 


Theorem L. The degree path length of a tree with n leaves is never less than 

f(n)= f3<?n + 2(n-3*), if 2 • 3*- 1 < n < 3"; 

\ 3qn + 4(n — 3 9 ), if 3« < n < 2 • 3«. [ ) 

Ternary trees T n defined by the rules 


T " n ' 

have the minimum degree path length. 



Proof. It is important to observe that / (n) is a convex function, namely that 

f(n+l)-f(n)>f(n)-f(n-l) for all n > 2. (8) 

The relevance of this property is due to the following lemma, which is dual to 
the result of exercise 2.3.4.5-17. 


Lemma C. A function g(n) defined on the positive integers satisfies 

- k )) = 9 (\n/ 2 \)+g(\n/ 2 \), n> 2, ( 9 ) 

if and only if it is convex. 

Proof. If g(n + 1) - g(n ) < g(n) - g(n - 1) for some n > 2, we have g(n + 1) + 
g(n - 1) < gin) + g(n), contradicting (9). Conversely, if (8) holds for g, and if 
1 < k < n — k, we have g(k + 1) + g(n — k — 1) < g(fc) + g(n — k) by convexity. | 
The latter part of Lemma C’s proof can be extended for any m > 2 to show 

that 


min (s(ni) + • • • + 

wiH b n m =n 

Tl\ , . . . ,Tlrn. ^ 1 

= g([n/m\) + g([(n + l)/mj ) + •■•+ g([{n + m — l)/mj) (10) 
whenever g is convex. Let 

fm(n) = f{\n/m\) + /([(n + l)/mj) + h /([(n + m - l)/mj); (11) 

the proof of Theorem L is completed by proving that / 3 (n) + 3n = f(n) and 
fm(n) + mn> }(n) for all m > 2. (See exercise 11.) | 

It would be very nice if optimal trees could always be characterized neatly 
as in Theorem L. But the results we have seen for a = /3 in Table 1 show that 
the function A\ (n) is not always convex. In fact, Table 1 is sufficient to disprove 
most simple conjectures about optimal trees! We can, however, salvage part of 
Theorem L in the general case; M. Schlumberger and J. Vuillemin have shown 
that large orders of merge can always be avoided: 


5.4.9 


DISKS AND DRUMS 367 


Theorem M. Given a and (3 as in Theorem H, there exists an optimal tree in 
which the degree of every node is at most 


[min ( k + (l + 

b+-) 

)i 

-se 

Al 

V a) 

> 


Proof. Let rii , . . . , n m be positive integers such that n x -\ \-n m — n, A(ni) + 

• • • + A(n m ) = A m (n), and n i < • • • < n m , and assume that m > d(a,/3) + 1. 
Let k be the value that minimizes (12); we shall show that 

cm(m - k) + /3n + A m _ k (n) < anm + /3n + A m (n), (13) 

hence the minimum value in (5) is always achieved for some m < d(a. B). 

By definition, since m > k + 2, we must have 

A m ~k{n) < Ai(nid hnfc + i) + j4i(nfc +2 )-| |-Ai(n m ) 

< a(ni~\ hn/ s+1 )(/c + l)+/3(n 1 H \-n k+ i) + Ai(ni)-\ \-Ai(n m ) 

= (a(A; + l)+/3)(niH \-n k +i) + A m (n) 

< (a(k + l) + (3)(k + l)n/m + A m (n), 

and (13) now follows easily. (Careful inspection of this proof shows that (12) is 
best possible, in the sense that some optimal trees must have nodes of degree 
d(ot,/3); see exercise 13.) | 

The construction in Theorem K needs 0(N 2 ) memory cells and 0(N 2 log N) 
steps to evaluate A m (n) for 1 < m < n < N; Theorem M shows that only O(N) 
cells and 0(N 2 ) steps are needed. Schlumberger and Vuillemin have discovered 
several more very interesting properties of optimal trees [Acta Informatica 3 
(1973), 25-36], Furthermore the asymptotic value of Ai(n) can be worked out 
as shown in exercise 9. 

* Another way to allocate buffers. David E. Ferguson [CACM 14 (1971), 
476-478] pointed out that seek time can be reduced if we don’t make all buffers 
the same size. The same idea occurred at about the same time to several other 
people [S. J. Waters, Comp. J. 14 (1971), 109-112; Ewing S. Walker, Software 
Age 4 (August-September, 1970), 16-17], 

Suppose we are doing a four- way merge on runs of equal length L 0 , with 
M characters of memory. If we divide the memory into equal buffers of size 
B - M/ 5, we need about L 0 /B seeks on each input file and 4 L 0 /B seeks for the 
output, totalling 8Lq/B = 40Lq/M seeks. But if we use four input buffers of 
size M/6 and one output buffer of size M/3, we need only about 4 x (6 L 0 /M) + 
4 x (3 L 0 /M) = 36 Lq/M seeks! The transmission time is the same in both cases, 
so we haven’t lost anything by the change. 

In general, suppose that we want to merge sorted files of lengths Li,...,Lp 
into a sorted file of length 


Lp + 1 — ■£'! + ••• + Lp, 


368 SORTING 


5.4.9 


and assume that a buffer of size B k is being used for the fcth file. Thus 


B i + • • • + Bp + Bp+i = M , 


(l4) 


where M is the total size of available internal memory. The number of seeks will 
be approximately 


L\ Lp Lppi 

— 1 — 4 £-±i. 

B\ Bp Bp + 1 


(15) 


Let’s try to minimize this quantity, subject to condition (14), assuming for 
convenience that the B k s don’t have to be integers. If we increase Bj by S 
and decrease B k by the same amount, the number of seeks changes by 


Bj Lj L k L k 

Bj + 8 Bj B k — 8 B k 


( Lk_ 


\B k (B k -S) Bj(Bj + 8) 


so the allocation can be improved if Lj/B? / B k /B\. Therefore we get the 
minimum number of seeks only if 



— 

~ Bp 


Lp+i 

B 2 P+1 


Since a minimum does exist it must occur when 


(x6) 


B k — \fL~ k + b \J Lp + 1), l<fe<P+l; (17) 

these are the only values of B u . . . , B P+1 that satisfy both (14) and (16). Plug- 
ging (17) into (15) gives a fairly simple formula for the total number of seeks, 

(\/Li + • • • + L P+ i Y/M, (18) 

which may be compared with the number (P + l)^ -| h Lp+i)/M obtained 

if all buffers are equal in length. By exercise 1.2.3-31, the improvement is 

E (v' Tj-VTkf/M . 

i<j<k<p+\ 


Unfortunately formula (18) does not lend itself to an easy determination of 
optimum merge patterns as in Theorem K (see exercise 14). 

The use of chaining. M. A. Goetz [ CACM 6 (1963), 245-248] has suggested 
an interesting way to avoid seek time on output, by linking individual tracks 
together. His idea requires a fairly fancy set of disk storage management routines, 
but it applies to many problems besides sorting, and it may therefore be a very 
worthwhile technique for general-purpose use. 

The concept is simple: Instead of allocating tracks sequentially within cyl- 
inders of the disk, we link them together and maintain lists of available space, 
one for each cylinder. When it is time to output a track of information, we write 
it on the current cylinder (wherever the access arm happens to be), unless that 
cylinder is full. In this way the seek time usually disappears. 


5.4.9 


DISKS AND DRUMS 369 


The catch is that we can’t store a link-to-next-track within the track itself, 
since the necessary information isn’t known at the right time. (We could store a 
link-to-previous-track and read the file backwards on the next pass, if that were 
suitable.) A table of link addresses for the tracks of each file can be maintained 
separately, because it requires comparatively little space. The available space 
lists can be represented compactly by using bit tables, with 1000 bits specifying 
the availability or unavailability of 1000 tracks. 

Forecasting revisited. Algorithm 5.4.6F shows that we can forecast which 
input buffer of a P - way merge will empty first, by looking at the last keys in 
each buffer. Therefore we can be reading and computing at the same time. 
That algorithm uses floating input buffers, not dedicated to a particular file; so 
the buffers must all be the same size, and the buffer allocation technique above 
cannot be used. But the restriction to a uniform buffer size is no great loss, since 
computers now have much larger internal memories than they used to. Nowadays 
a natural buffer size, such as the capacity of a full disk track, often suggests itself. 

Let us therefore imagine that the P runs to be merged each consist of a 
sequence of data blocks, where each block (except possibly the last) contains 
exactly B records. D. L. Whitlow and A. Sasson developed an interesting 
algorithm called SyncSort [U.S. Patent 4210961 (1980)], which improves on 
Algorithm 5.4. 6F by needing only three buffers of size B together with a memory 
pool holding PB records and PB pointers. By contrast, Algorithm 5.4.6F 
requires 2 P input buffers and 2 output buffers, but no pointers. 

SyncSort begins by reading the first block of each run and putting these PB 
records into the memory pool. Each record in the memory pool is linked to its 
successor in the run it belongs to, except that the final record in each block has 
no successor as yet. The smallest of the keys in those final records determines 
the run that will need to replenished first, so we begin to read the second block 
of that run into the first buffer. Merging begins as soon as that second block has 
been read; by looking at its final key we can accurately forecast the next relevant 
block, and we can continue in the same way to prefetch exactly the right blocks 
to input, just before they are needed. 

The three SyncSort buffers are arranged in a circle. As merging proceeds, 
the computer is processing data in the current buffer, while input is being read 
into the next buffer and output is being written from the third. The merging 
algorithm exchanges each record in the current buffer with the next record of 
output, namely the record in the memory pool that has the smallest key. The 
selection tree and the successor links are also updated appropriately as we make 
each exchange. Once the end of the current buffer is reached, we are ready to 
rotate the buffer circle: The reading buffer becomes current, the writing buffer 
is used for reading, and we' begin to write from the former current buffer. 

Many extensions of this basic idea are possible, depending on hardware 
capabilities. For example, we might use two disks, one for reading and one for 
writing, so that input and output and merging can all take place simultaneously. 
Or we might be able to overlap seek time by extending the circle to four or more 
buffers, as in Fig. 26 of Section 1.4.4, and deviating from the forecast input order. 


370 SORTING 


5.4.9 


Using several disks. Disk devices once were massive both in size and weight, 
but they became dramatically smaller, lighter, and less expensive during the 
late 1980s — although they began to hold more data than ever before. Therefore 
people began to design algorithms for once-unimaginable clusters of 5 or 10 or 
50 disk devices or for even larger disk farms. 

One easy way to gain speed with additional disks is to use the technique 
of disk striping for large files. Suppose we have D disk units, numbered 0, 1, 

. . . , D — 1, and consider a file that consists of L blocks a () ai . . . a^_ i- Striping 
this file on D disks means that we put block a,j on disk number j mod D; thus, 
disk 0 holds aoaoCi 2 D ■ ■ ■ , disk 1 holds aiau + ia 2 D+i ■ ■ ■ , etc. Then we can 
perform D reads or D writes simultaneously on D-block groups aoai . . ,a D _ 
(idcld+i ■ • ■ «2D-i, •••, which are called superblocks. The individual blocks of 
each superblock should be on corresponding cylinders on different disks so that 
the seek time will be the same on each unit. In essence, we are acting as if we 
had a single disk unit with blocks and buffers of size DB, but the input and 
output operations run up to D times faster. 

An elegant improvement on superblock striping can be used when we’re 
doing 2-way merging, or in general whenever we want to match records with 
equal keys in two files that are in order by keys. Suppose the blocks a 0 aia 2 ■ ■ ■ of 
the first file are striped on D disks as above, but the blocks 6 0 bi 6 2 . . . of the other 
file are striped in the reverse direction, with block bj on disk (D — 1 — j) mod D. 
For example, if D — 5 the blocks a 3 appear respectively on disks 0, 1, 2, 3, 4, 
0, 1, . . . , while the blocks bj for j > 0 appear on 4, 3, 2, 1, 0, 4, 3, ... . Let aj 
be the last key of block aj and let B 3 be the last key of block bj . By examining 
the a’s and B’s we can forecast the sequence in which we will want to read the 
data blocks; this sequence might, for example, be 

aob( ) a 1 a 2 bi 0304620506 0703636465 60676369610 .... 

These blocks appear respectively on disks 

04123 34201 23104 32104 ... 

when D — 5, and if we read them five at a time we will be inputting successively 
from disks {0, 4, 1, 2, 3}, {3, 4, 2, 0, 1}, {2, 3, 1, 0, 4}, {3, 2, 1, 0, 4}, . . . ; there will 
never be a conflict in which we need to read two blocks from the same disk at the 
same time! In general, with D disks we can read D at a time without conflict, 
because the first group will have k blocks ao . . . a^-i on disks 0 through k — 1 and 
D — k blocks 60 . . - bo-k-i on disks D — 1 through k, for some k; then we will be 
poised to continue in the same way but with disk numbers shifted cyclically by k. 

This trick is well known to card magicians, who call it the Gilbreath principle ; 
it was invented during the 1960s by Norman Gilbreath [see Martin Gardner, 
Mathematical Magic Show (New York: Knopf, 1977), Chapter 7; N. Gilbreath, 
Genii 52 (1989), 743-744]. We need to know the cc’s and /3’s, to decide what 
blocks should be read next, but that information takes up only a small fraction of 
the space needed by the a’s and 6’s, and it can be kept in separate files. Therefore 
we need fewer buffers to keep the input going at full speed (see exercise 23). 


5.4.9 


DISKS AND DRUMS 371 


Randomized striping. If we want to do P- way merging with D disks when 
P and D are large, we cannot keep reading the information simultaneously from 
D disks without conflict unless we have a large number of buffers, because there 
is no analog of the Gilbreath principle when P > 2. No matter how we allocate 
the blocks of a file to disks, there will be a chance that we might need to read 
many blocks into memory before we are ready to use them, because the blocks 
that we really need might all happen to reside on the same disk. 

Suppose, for example, that we want to do 8-way merging on 5 disks, and 
suppose that the blocks ao a i a 2 • • • , 606162 • • • , ■ ■ ■ , 606162 ... of 8 runs have 
been striped with aj on disk j mod D, bj on disk ( j + 1) mod D, hj on disk 
(j + ?) mod D. We might need to access these blocks in the order 

aobocodoeo fogohodiei d2e2d 3 aifi bigi < 22/263 d 4 cih\b2g2 a 3 f 3 e 4 d 3 de . . . ; ( 19 ) 

then they appear on the respective disks 

012340124001111222222333333334..., (20) 

so our best bet is to input them as follows: 

Time 1 Time 2 Time 3 Time 4 Time 5 

aob 0 c 0 d 0 e 0 fogohoCidi Gi^bihide d 3 d 3 gi 62 ? ? ai<i2<?2 ? 

Time 6 Time 7 Time 8 Time 9 
?/i/ 2 a 3 ? ??e 3 / 3 ? ??d 4 e 4 ? ? ? ? d 5 ? ( 21 ) 

By the time we are able to look at block d$, we need to have read d§ as well as 
15 blocks of future data denoted by “?”, because of congestion on disk 3. And 
we will not yet be done with the seven buffers containing remnants of a 3 , 1 ) 2 - ci, 
e 4, h, 92, and hi; so we will need buffer space for at least (16 + 8 + 5 )B input 
records in this particular example. 

The simple superblock approach to disk striping would proceed instead to 
read blocks a 0 aia 2 a 3 a 4 at time 1, 6 0 6i6 2 6 3 6 4 at time 2, . . . , 6 0 6i6 2 6 3 6 4 at time 8, 
then dsdgd^dsdg at time 9 (since d^d^dj d 3 dg is the superblock needed next), and 
so on. Using the SyncSort strategy, it would require buffers for (P + 3) DB 
records and PDB pointers in memory. The more versatile approach indicated 
above can be shown to need only about half as much buffer space; but the 
memory requirement is still approximately proportional to PDB when P and D 
are large (see exercise 24). 

R. D. Barve, E. F. Grove, and J. S. Vitter [ Parallel Computing 23 (1997), 
601—631] showed that a slight modification of the independent-block approach 
leads to an algorithm that keeps the disk input/output running at nearly its full 
speed while needing only 0(P + D log D) buffer blocks instead of fl(PD). Their 
technique of randomized striping puts block j of run k on disk ( x k + j ) mod D. 
where x k is a random integer selected just before run k is first written. Instead 
of insisting that D blocks are constantly being input, one from each disk, they 
introduced a simple mechanism for holding back when there isn’t enough space 
to keep reading ahead on certain disks, and they proved that their method is 
asymptotically optimal. 


372 SORTING 


5.4.9 


To do P - way merging on D disks with randomized striping, we can maintain 
2D + P + Q - 1 floating input buffers, each holding a block of B records. Input 
is typically being read into D of these buffers, called active read buffers, while P 
of the others contain the leading blocks from which records are currently being 
merged; these are called active merge buffers. The remaining D + Q - 1 “scratch 
buffers” are either empty or they hold prefetched data that will be needed later; 
Q is a nonnegative parameter that can be increased in order to lessen the chance 
that reading will be held back on any of the disks. 

The blocks of all runs can be arranged into chronological order as in ( 19 ): 
First we list block 0 of each run, then we list the others by determining the order 
in which active merge buffers will become empty. As explained above, this order 
is determined by the final keys in each block, so we can readily forecast which 
blocks ought to be prefetched first. 

Let’s consider example ( 19 ) again, with P = 8 , D = 5, and Q = 4. Now we 
will have only 2D + P + Q — 1 = 21 input buffer blocks to work with instead 
of the 29 that were needed above for maximum-speed reading. We will use the 
offsets 


xi = 3, x 2 = 1, x 3 = 4, X 4 = 1, x 5 = 0, xq = 4, x 7 = 2, x s — 1 ( 22 ) 

(suggested by the decimal digits of n) for runs a, b, . . . , h; thus the respective 
disks contain 
Disk Blocks 

0 : e 0 fi a 2 d 4 c 4 

1 ; bo d 0 ho e 4 / 2 a 3 d 5 

2: 9o di e 2 f>i hi f 3 d 6 . . . ( 23 ) 

3: ao d 2 g x e 3 b 2 

4: c 0 /o d 3 ai g 2 e 4 

if we list their blocks in chronological order. The “random” offsets of ( 22 ), 
together with sequential striping within each run, will tend to minimize the 
congestion of any particular chronological sequence. The actual processing now 
goes like this: 


Time 1 

Active reading 
eoMoaoCo 

Active merging 

Scratch Waiting for 
( ) ao 

Time 2 

fid Q did 2 fo 

ao 

bo Co (eo go ) 

do 

Time 3 

a 2 hoe 2 gid 3 

ao bo Co do 

e ofo 9 o(did 2 fi ) 

ho 

Time 4 

o- 2 £ibigiai 

ao fro co do eo fo go ho 

di(d 2 e 2 d 3 figia 2 — ) 

e i ( 24 ) 

Time 5 

d\f 2 h\e 3 g 2 

aobocodieifogoho 

d 2 e 2 d 3 aifibigia 2 ( ) 

/2 

Time 6 

cia 3 f 3 b 2 e 4 

a 2 b\Cod 3 e 2 f 2 giho 

e 3 d 4 (hig 2 ) 

Cl 

Time 7 

? d 5 d 6 ? ? 

a 2 bi Cid 4 e 3 f 2 gih 0 

h\b 2 g 2 a 3 f 3 e 4 ( ) 

d 5 


At each unit of time we are waiting for the chronologically first block that is 
not yet merged and not yet in a scratch buffer; this is one of the blocks that is 
currently being input to an active read buffer. We assume that the computer 
is much faster than the disks; thus, all blocks before the one we are waiting for 
will have already entered the merging process before input is complete. We also 


5.4.9 


DISKS AND DRUMS 373 


assume that sufficient output buffers are available so that merging will not be 
delayed by the lack of a place to place the output (see exercise 26). When a round 
of input is complete, the block we were waiting for is immediately classified as an 
active merge buffer, and the empty merge buffer it replaces will be used for the 
next active reading. The other D—l active read buffers now trade places with the 
D — l least important scratch buffers; scratch buffers are ranked by chronological 
order of their contents. On the next round we will wait for the first unmerged 
block that isn’t present in the scratch buffers. Any scratch buffers preceding that 
block in chronological order will become part of the active merge before the next 
input cycle, but the others — shown in parentheses above — will be carried over 
and they will remain as scratch buffers on the next round. However, at most Q 
of the buffers in parentheses can be carried over, because we will need to convert 
D — l scratch buffers to active read status immediately after the input is ready. 
Any additional scratch buffers are effectively blanked out, as if they hadn’t been 
read. This blanking-out occurs at Time 4 in (24): We cannot carry all six of 
the blocks d^e^d^figia^ over to Time 5, because Q = 4, so we reread 51 and a^. 
Otherwise the reading operations in this example take place at full speed. 

Exercise 29 proves that, given any chronological sequence of runs to be 
merged, the method of randomized striping will achieve the minimum number 
of disk reads within a factor of r(D, Q - f 2), on the average, where the function r 
is tabulated in Table 2. For example, if D = 4 and Q = 18, the average time 
to do a P - way merge on L blocks of data with 4 disks and P + 25 input buffers 
will be at most the time to read r(4, 20 )L/D 1.785L/4 blocks on a single disk. 
This theoretical upper bound is quite conservative; in practice the performance 
is even better, very near the optimum time L/4. 


Table 2 

GUARANTEES ON THE PERFORMANCE OF RANDOMIZED STRIPING 




r(d, d) 

r(d, 2d) 

r(d , 3d) 

r(d , 4 d) 

r(d, 5 d) 

r(d, 6 d) 

r(d, 7 d) 

r(cf, 8 d) 

r(d, 9 d) 

r(d, 10 d) 

d = 

2 

1.500 

1.500 

1.499 

1.467 

1.444 

1.422 

1.393 

1.370 

1.353 

1.339 

d = 

4 

2.460 

2.190 

1.986 

1.888 

1.785 

1.724 

1.683 

1.633 

1.597 

1.570 

d = 

8 

3.328 

2.698 

2.365 

2.183 

2.056 

1.969 

1.889 

1.836 

1.787 

1.743 

d = 

16 

4.087 

3.103 

2.662 

2.434 

2.277 

2.156 

2.067 

1.997 

1.933 

1.890 

d = 

32 

4.503 

3.392 

2.917 

2.654 

2.458 

2.319 

2.218 

2.130 

2.062 

2.005 

d = 

64 

5.175 

3.718 

3.165 

2.847 

2.613 

2.465 

2.346 

2.249 

2.174 

2.107 

d = 

128 

5.431 

3.972 

3.356 

2.992 

2.759 

2.603 

2.459 

2.358 

2.273 

2.201 

d = 

256 

5.909 

4.222 

3.536 

3.155 

2.910 

2.714 

2.567 

2.464 

2.363 

2.289 

d = 

512 

6.278 

4.455 

3.747 

3.316 

3.024 

2.820 

2.675 

2.556 

2.450 

2.375 

d = 

1024 

6.567 

4.689 

3.879 

3.434 

3.142 

2.937 

2.780 

2.639 

2.536 

2.452 


Will keysorting help? When records are long and keys are short, it is very 
tempting to create a new file consisting simply of the keys together with a serial 
number specifying their original file location. After sorting this key file, we can 
replace the keys by the successive numbers 1,2 ,...; the new file can then be 
sorted by original file location and we will have a convenient specification of how 
to unshuffle the records for the final rearrangement. Schematically, the process 


374 SORTING 


has the following form: 


i) 

Original file 

(K 1 ,I 1 )(K 2 ,I 2 ).. 

■ {Kn, In) 

long 

ii) 

Key file 

(K u 1)(K 2 ,2).. 

(K n ,N) 

short 

iii) 

Sorted (ii) 

( K pi,Pi){K P2 ,P2 ).. 

(Kpn’Pn) 

short 

iv) 

Edited (iii) 

(l,Pi)(2,p 2 ).. 

( N, Pn ) 

short 

v) 

Sorted (iv) 

(9i,1)(92,2).. 

C q N ,N ) 

short 

vi) 

Edited (i) 

{qii h){q2, h) ■ ■ 

{qN i In) 

long 


Here pj = k if and only if q k = j. The two sorting processes in (iii) and (v) are 
comparatively fast (perhaps even internal sorts), since the records aren’t very 
long. In stage (vi) we have reduced the problem to sorting a file whose keys are 
simply the numbers (1,2,..., IV}; each record now specifies exactly where it is 
to be moved. 

The external rearrangement problem that remains after stage (vi) seems 
trivial, at first glance; but in fact it is rather difficult, and no really good 
algorithms (significantly better than sorting) have yet been found. We could 
obviously do the rearrangement in N steps, moving one record at a time; for 
large enough N this is better than the N log IV of a sorting method. But N is 
never that large; N is, however, sufficiently large that N seeks are unthinkable. 

A radix sorting method can be used efficiently on the edited records of (vi), 
since their keys have a perfectly uniform distribution. On modern computers, the 
processing time for an eight-way distribution is much faster than the processing 
time for an eight-way merge; hence a distribution sort is probably the best 
procedure. (See Section 5.4.7, and see also exercise 19.) 

On the other hand, it seems wasteful to do a full sort after the keys have 
already been sorted. One reason the external rearrangement problem is unex- 
pectedly difficult has been discovered by R. W. Floyd, who found a nontrivial 
lower bound on the number of seeks required to rearrange records on a disk device 
[Complexity of Computer Computations (New York: Plenum, 1972), 105-109], 

It is convenient to describe Floyd’s result in terms of the elevator problem of 
Section 5.4.8; but this time we want to find an elevator schedule that minimizes 
the number of stops, instead of minimizing the distance traveled. Mini miz ing 
the number of stops is not precisely equivalent to finding the minimum-seek 
rearrangement algorithm, since a stop combines input to the elevator with output 
from the elevator; but the stop-minimization criterion is close enough to indicate 
the basic ideas. 

We shall make use of the “discrete entropy” function 

F( n ) = + l) = B(n) + n - 1 = nflgn] - 2^ Ign 1 + n, (25) 

l<k<n 

where B(n) is the binary insertion function, Eq. 5. 3. 1^(3). By Eq. 5. 3. 1^(34), 
F(n) is the minimum external path length of a binary tree with n leaves, and 

nlgn < F(n) < nlgn + 0.0861n. (26) 


5.4.9 


DISKS AND DRUMS 375 


Since F(n) is convex and satisfies F(n) = n + F(|_n/2j) + F(\n/ 2]), we know 
by Lemma C above that 

F(n) < F(k) + F(n — k) + n, for 0 < A; < n. (27) 

This relation is also evident from the external path length characterization of F; 
it is the crucial fact we need in the following argument. 

As in Section 5.4.8 we shall assume that each floor holds b people, the 
elevator holds m people, and there are n floors. Let Sij be the number of people 
currently on floor i whose destination is floor j. The togetherness rating of any 
configuration of people in the building is defined to be the sum ^i<i j< n F( s ij)- 
For example, assume that b — m = n = 6 and that the 36 people are initially 
scattered among the floors as follows: 

uuuuuu . . 

123456 123456 123456 123456 123456 123456 

The elevator is empty, sitting on floor 1; “u” denotes a vacant position. Each 
floor contains one person with each possible destination, so all s^ are 1 and the 
togetherness rating is zero. If the elevator now transports six people to floor 2, 
we have the configuration 

123456 , 

uuuuuu 123456 123456 123456 123456 123456 v 2 9) 

and the togetherness rating becomes 6F(0) + 24F(1) + 6F(2) = 12. Suppose the 
elevator now carries 1, 1, 2, 3, 3, and 4 to floor 3: 

112334 , s 

uuuuuu 245566 123456 123456 123456 123456 

The togetherness rating has jumped to 4F(2) + 2F(3) = 18. When all people 
have finally been transported to their destinations, the togetherness rating will 
be 6F(6) = 96. 

Floyd observed that the togetherness rating can never increase by more than 
b + m at each stop, since a set of s equal-destination people joining with a similar 
set of size s' improves the rating by F(s + s') - F(s) - F(s') < ,s + s’ . Therefore 
we have the following result. 

Theorem F. Let t be the togetherness rating of an initial configuration of 
bn people, in terms of the definitions above. The elevator must make at least 

\(F{b)n-t)/(b + m)] 

stops in order to bring them all to their destinations. | 

Translating this result into disk terminology, let there be bn records, with 
b per block, and suppose the internal memory can hold m records at a time. 
Every disk read brings one block into memory, every disk write stores one block, 
and s^ is the number of records in block i that belong in block j. If n > b, 
there are initial configurations in which all the s^ are < 1; so t = 0 and at least 
f(b)n/ (b + m) « ( bnlgb)/m block-reading operations are necessary to rearrange 


376 SORTING 


5.4.9 


the records. (The factor lgft makes this lower bound nontrivial when b is large.) 
Exercise 17 derives a substantially stronger lower bound for the common case 
that m is substantially larger than b. 

EXERCISES 

1. [M22] The text explains a method by which the average latency time required 
to read a fraction x of a track is reduced from \ to |(1 - x 2 ) revolutions. This is 
the minimum possible value, when there is one access arm. What is the corresponding 
minimum average latency time if there are two access arms, 180° apart, assuming that 
only one arm can transmit data at any one time? 

2 . [M30] (A. G. Konheim.) The purpose of this problem is to investigate how far the 
access arm of a disk must move while merging files that are allocated “orthogonally” 
to the cylinders. Suppose there are P files, each containing L blocks of records, and 
assume that the first block of each file appears on cylinder 1, the second on cylinder 2, 
etc. The relative order of the last keys in each block governs the access arm motion 
during the merge, hence we may represent the situation in the following mathematically 
tractable way: Consider a set of PL ordered pairs 


( ffl ll> 1) 

(°21)1) 

. . (dpi, 1) 

(tt 12, 2) 

( a 2 2,2) 

•• (dp2, 2) 

(aiL,L) 

(d2L,L) ., 

. . ( apL,L ) 

where the set {ay | 1 < i < P, 1 < 

j < L} consists of the numbers {1,2, 


some order, and where ay < a l(j+1 ) for 1 < j < L. (Rows represent cylinders, columns 
represent input files.) Sort the pairs on their first components and let the resulting 
sequence be (1, jt) (2 ,j 2 ) . . . ( PL,j PL ). Show that, if each of the (PL)\/L\ p choices of 
the a,ij is equally likely, the average value of 

I h ~ h\ + \h — h\ -t h | jpL - jpL—i | 

is 

(L - 1) (l + (P- 1)2— / . 

[Hint: See exercise 5.2.1—14.] Notice that as L — > oc this value is asymptotically equal 
to \{P-1)LV^L + 0{PL). 

3. [ M15 ] Suppose the internal memory is limited so that 10-way merging is not 
feasible. How can recurrence relations (3), (4), (5) be modified so that A x (n) is the 
minimum value of aD(T) + f)E(T), over all n - leaved trees T having no internal nodes 
of degree greater than 9? 

► 4. [M21] Consider a modified form of the square root buffer allocation scheme, in 
which all P of the input buffers have equal length, but the output buffer size should 
be chosen so as to minimize seek time. 

a) Derive a formula corresponding to (2), for the running time of an L-character 
P-way merge. 

b) Show that the construction in Theorem K can be modified in order to obtain a 
merge pattern that is optimal according to your formula from part (a). 


5.4.9 


DISKS AND DRUMS 377 


5. [M20] When two disks are being used, so that reading on one is overlapped with 
writing on the other, we cannot use merge patterns like that of Fig. 93 since some leaves 
are at even levels and some are at odd levels. Show how to modify the construction of 
Theorem K in order to produce trees that are optimal subject to the constraint that 
all leaves appear on even levels or all on odd levels. 

► 6 . [22] Find a tree that is optimum in the sense of exercise 5, when n = 23 and 
a — 0 = 1 . (You may wish to use a computer.) 

► 7. [ M24 } When the initial runs are not all the same length, the best merge pattern 
(in the sense of Theorem H) minimizes aD(T) + 0E{T). where D(T) and E{T ) now 
represent weighted path lengths: Weights wi, ... ,w n (corresponding to the lengths of 
the initial runs) are attached to each leaf of the tree, and the degree sums and path 
lengths are multiplied by the appropriate weights. For example, if T is the tree of 
Fig. 92, we would have D(T ) = 6uq + 6vj 2 + 7w 3 + 9w 4 + 9w 5 + 7w& + 4w 7 + 4 wg, 
E(T) = 2wi + 2w2 + 2w3 + 3 u)4 + 3u) 5 + 2w 6 + w? + Ws- 

Prove that there is always an optimal pattern in which the shortest k runs are 
merged first, for some k. 

8 . [49] Is there an algorithm that finds optimal trees for given a, 0 and weights 
wi, ... ,w n , in the sense of exercise 7, taking only 0(n c ) steps for some c? 

9. [HM39] (L. Hyafil, F. Prusker, J. Vuillemin.) Prove that, for fixed a and 0, 

. . . ( . am + 0 \ , _ . . 

Ai(n) = mm — nlogn + O(n) 

\m > 2 log m J 

asn-+ 00 , where the O(n) term is > 0 . 

10. [HMJ,4] (L. Hyafil, F. Prusker, J. Vuillemin.) Prove that when a and 0 are fixed, 
Ai(n) = amn + 0n + A m (n) for all sufficiently large n, if m minimizes the coefficient 
in exercise 9. 

11. [M29] In the notation of ( 6 ) and ( 11 ), prove that fm(n)+mn > f(n) for all m > 2 
and n > 2 , and determine all m and n for which equality holds. 

12. [25] Prove that, for all n > 0, there is a tree with n leaves and minimum degree 
path length ( 6 ), with all leaves at the same level. 

13. [M24] Show that for 2 < n < d(a,0), where d(a,0) is defined in ( 12 ), the unique 
best merge pattern in the sense of Theorem H is an n- way merge. 

14. [40] Using the square root method of buffer allocation, the seek time for the 
merge pattern in Fig. 92 would be proportional to (\/2 + yfi + %/T + VT + \/8) 2 + 
(\/T + VI + V2) 2 + (\/l + \[2 + \fl + \/4) 2 + (v/I + Vl + t/2) 2 ; this is the sum, 
over each internal node, of (\Aii + • • • + V n rn + %/m + • • • + n m ) 2 , where that node’s 
respective subtrees have (ni, . . . , n m ) leaves. Write a computer program that generates 
minimum-seek time trees having 1, 2, 3, ... leaves, based on this formula. 

15. [ M22 ] Show that Theorem F can be improved slightly if the elevator is initially 
empty and if F(b)n ^ t: At least \(F(b)n + m — t)/(b + m )] stops are necessary in 
such a case. 

16. [23] (R. W. Floyd.) Find an elevator schedule that transports all the people 
of ( 28 ) to their destinations in at most 12 stops. (Configuration ( 29 ) shows the situation 
after one stop, not two.) 


378 SORTING 


5.4.9 


► 17 . [HM2S] (R. W. Floyd, 1980.) Show that the lower bound of Theorem F can be 
improved to 

n(61nn — In 6 — 1) 

Inn + 6(1 + ln(l + m/6)) ’ 

in the sense that some initial configuration must require at least this many stops. [Hint: 
Count the configurations that can be obtained after s stops.] 

18 . [HM26] Let L be the lower bound of exercise 17. Show that the average number 
of elevator stops needed to take all people to their desired floors is at least L — 1 , when 
the (bn)! possible permutations of people into bn desks are equally likely. 

► 19. [25] (B. T. Bennett and A. C. McKellar.) Consider the following approach to 
keysorting, illustrated on an example file with 10 keys: 

i) Original file: (50, J 0 )(08, 70(51, / 2 )(06,/ 3 )(90,/4)(17,/ B )(89,/ 6 )(27,/r)(65,/ 8 )(42,/ 9 ) 
n) Key file: (50, 0)(08, 1)(51, 2)(06, 3)(90, 4)(17, 5)(89, 6)(27, 7)(65, 8)(42, 9) 

iii) Sorted (ii): (06, 3)(08, 1)(17, 5)(27, 7)(42, 9)(50, 0)(51, 2)(65, 8)(89, 6)(90, 4) 

iv) Bin assignments (see below): (2, 1)(2, 3)(2, 5)(2, 7)(2, 8)(2, 9)(1, 0)(l, 2)(1, 4)(1, 6) 

v) Sorted (iv): (1, 0)(2, 1)(1, 2)(2, 3)(1, 4)(2, 5)(1, 6)(2, 7)(2, 8)(2, 9) 

vi) (i) distributed into bins using (v): 

Bin 1: (50,/o)(51,/ 2 )(90,/ 4 )(89,/ 6 ) 

Bin 2: (08, /i)(06, 7 3 )(17, 7 5 )(27, / 7 )(65, 7 8 )(42, / 9 ) 

vii) The result of replacement selection, reading first bin 2, then bin 1: 

(06,7 3 )(08,/ 1 )(17,/ 5 )(27,/ 7 )(42,/ 9 )(50,/o)(51,/ 2 )(65,/ 8 )(89,/ 6 )(90,/4) 

The assignment of bin numbers in step (iv) is made by doing replacement selection 
on (iii), from right to left, in decreasing order of the second component. The bin 
number is the run number. The example above uses replacement selection with only 
two elements in the selection tree; the same size tree should be used for replacement 
selection in both (iv) and (vii). Notice that the bin contents are not necessarily in 
sorted order! 

Prove that this method will sort, namely that the replacement selection in (vii) 
wdl produce only one run. (This technique reduces the number of bins needed in a 
conventional keysort by distribution, especially if the input is largely in order already.) 

► 20 . [25] Modern hardware/ software systems provide programmers with a virtual mem- 
ory: Programs are written as if there were a very large internal memory, able to contain 
all of the data. This memory is divided into pages , only a few of which are in the actual 
internal memory at any one time; the others are on disks or drums. Programmers need 
not concern themselves with such details, since the system takes care of everything; 
new pages are automatically brought into memory when needed. 

It would seem that the advent of virtual memory technology makes external sorting 
methods obsolete, since the job can simply be done using the techniques developed for 
internal sorting. Discuss this situation; in what ways might a hand-tailored external 
sorting method be better than the application of a general-purpose paging technique 
to an internal sorting method? 

► 21 . [Ml 5] How many blocks of an L-block file go on disk j when the file is striped on 
D disks? 

22. [22] If you are merging two files with the Gilbreath principle and you want to 
store the keys otj with the a blocks and the keys ft with the 6 blocks, in which block 
should otj be placed in order to have the information available when it is needed? 


5.4.9 


DISKS AND DRUMS 379 


► 23 . [20] How much space is needed for input buffers to keep input going continuously 
when two-way merging is done by (a) superblock striping? (b) the Gilbreath principle? 

24 . [M36] Suppose P runs have been striped on D disks so that block j of run k 
appears on disk (xk + j) mod D. A P- way merge will read those blocks in some 
chronological order such as ( 19 ). If groups of D blocks are to be input continuously, we 
will read at time t the chronologically tth block stored on each disk, as in ( 21 ). What 
is the minimum number of buffer records needed in memory to hold input data that 
has not yet been merged, regardless of the chronological order? Explain how to choose 
the offsets x x , x 2 , . . . , x P so that the fewest buffers are needed in the worst case. 

25 . [23] Rework the text’s example of randomized striping for the case Q = 3 instead 
of Q — 4. What buffer contents would occur in